Database Reference
In-Depth Information
FROM (
MAP doctext USING 'python wc_mapper.py' AS (word, cnt)
FROM docs
CLUSTER BY word
) a
REDUCE word, cnt USING 'python wc_reduce.py';
Fig. 9.13
An example HiveQl query
unstructured world of Hadoop while still maintaining the extensibility and flexibility
that Hadoop provides. Thus, it supports all the major primitive types (e.g. integers,
floats, strings) as well as complex types (e.g. maps, lists, structs). Hive supports
queries expressed in an SQL-like declarative language, HiveQL [ 29 ], and therefore
can be easily understood by anyone who is familiar with SQL. These queries
are compiled into MapReduce jobs that are executed using Hadoop. In addition,
HiveQL enables users to plug in custom MapReduce scripts into queries [ 224 ]. For
example, the canonical MapReduce word count example on a table of documents
(Fig. 9.1 ) can be expressed in HiveQL as depicted in Fig. 9.13 where the MAP clause
indicates how the input columns ( doctext ) can be transformed using a user program
('python wc_mapper.py') into output columns ( word and cnt ). The REDUCE clause
specifies the user program to invoke ('python wc_reduce.py') on the output columns
of the subquery.
HiveQL supports Data Definition Language (DDL) statements which can be
used to create, drop and alter tables in a database [ 223 ]. It allows users to load
data from external sources and insert query results into Hive tables via the load
and insert Data Manipulation Language (DML) statements respectively. However,
HiveQL currently does not support the update and deletion of rows in existing
tables (in particular, INSERT INTO, UPDATE and DELETE statements) which
allows the use of very simple mechanisms to deal with concurrent read and
write operations without implementing complex locking protocols. The metastore
component is the Hive's system catalog which stores metadata about the underlying
table. This metadata is specified during table creation and reused every time the
table is referenced in HiveQL. The metastore distinguishes Hive as a traditional
warehousing solution when compared with similar data processing systems that are
built on top of MapReduce-like architectures like Pig Latin [ 188 ].
Tenzing
The Tenzing system [ 100 ] has been presented by Google as an SQL query execution
engine which is built on top of MapReduce and provides a comprehensive SQL92
implementation with some SQL99 extensions (e.g. ROLLUP() and CUBE() OLAP
extensions). Tenzing also supports querying data in different formats such as: row
 
Search WWH ::




Custom Search