Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

FROM (

MAP doctext USING 'python wc_mapper.py' AS (word, cnt)

FROM docs

CLUSTER BY word

) a

REDUCE word, cnt USING 'python wc_reduce.py';

FIGURE 2.13

An example HiveQl query. (From A. Thusoo et al., PVLDB , 2(2), 1626-1629,

20 09.)

Thus, it supports all the major primitive types (e.g., integers, floats, strings) as well

as complex types (e.g., maps, lists, structs). Hive supports queries expressed in an

SQL-like declarative language, HiveQL ,* and therefore can be easily understood by

anyone who is familiar with SQL. These queries are compiled into MapReduce jobs

that are executed using Hadoop. In addition, HiveQL enables users to plug in custom

MapReduce scripts into queries [125]. For example, the canonical MapReduce word

count example on a table of documents (Figure 2.1) can be expressed in HiveQL as

depicted in Figure 2.13 where the MAP clause indicates how the input columns ( doc-

text ) can be transformed using a user program ('python wc_mapper.py') into output

columns ( word and cnt ). The REDUCE clause specifies the user program to invoke

('python wc_reduce.py') on the output columns of the subquery.

HiveQL supports Data Definition Language (DDL) statements, which can be

used to create, drop, and alter tables in a database [124]. It allows users to load

data from external sources and insert query results into Hive tables via the load

and insert Data Manipulation Language (DML) statements, respectively. However,

HiveQL currently does not support the update and deletion of rows in existing tables

(in particular, INSERT INTO, UPDATE, and DELETE statements), which allows

the use of very simple mechanisms to deal with concurrent read and write opera-

tions without implementing complex locking protocols. The metastore component

is the Hive's system catalog, which stores metadata about the underlying table. This

metadata is specified during table creation and reused every time the table is refer-

enced in HiveQL. The metastore distinguishes Hive as a traditional warehousing

solution when compared with similar data-processing systems that are built on top

of MapReduce-like architectures like Pig Latin [109].

2.4.4 t Tenzing

The Tenzing system [33] has been presented by Google as an SQL query execu-

tion engine which is built on top of MapReduce and provides a comprehensive

SQL92 implementation with some SQL99 extensions (e.g., ROLLUP() and CUBE()

OLAP extensions). Tenzing also supports querying data in different formats such

as: row stores (e.g., MySQL database), column stores, Bigtable (Google's built in

* http://wiki.apache.org/hadoop/Hive/LanguageManual.

Search WWH ::

Custom Search

Home