Information Technology Reference
In-Depth Information
The main difference between Hive and the other languages previously
discussed comes from the fact that Hive's design is more influenced by
classic relational warehousing systems, which is evident both at the data
model and at the query language level. Hive thinks of its data in relational
terms—data sources are stored in tables, consisting of a fixed number
of rows with predefined data types. Similar to Pig and Jaql, Hive's data
model provides support for semistructured and nested data in the form of
complex data types like associative arrays (maps), lists, and structs, which
facilitates the use of denormalized inputs. On the other hand, Hive dif-
fers from the other higher-level languages for Hadoop in the fact that it
uses a catalog to hold metadata about its input sources. This means that
the table schema must be declared and the data loaded before any queries
involving the table are submitted to the system (which mirrors the stan-
dard RDBMS process). The schema definition language extends the classic
DDL CREATE TABLE syntax. Currently, Hive does not provide support
for updates, which means that any data load statement will enforce the
removal of any old data in the specified target table or partition. The stan-
dard way to append data to an existing table in Hive is to create a new
partition for each append set. Since appends in an OLAP environment are
typically performed periodically in a batch manner, this strategy is a good
fit for most real-world scenarios.
The Hive Query Language (HiveQL) is an SQL dialect with various syntax
extensions. HiveQL supports many traditional SQL features like from clause
subqueries, various join types, group bys and aggregations, as well as many
useful built-in data processing functions that provide an intuitive syntax for
writing Hive queries to all users familiar with the SQL basics. In addition,
HiveQL provides native support for in-line MapReduce job specification. The
semantics of the mapper and the reducer are specified in external scripts,
which communicate with the parent Hadoop task through the standard
input and output streams (similar to the streaming API for user-defined
functions (UDFs) in Pig).
17.3.4 Pig
As data volumes and processing complexities increase, analyzing large data
sets introduces dataflow complexities that become harder to implement
in a MapReduce program. There was a need for an abstraction layer over
MapReduce: a high-level language that is more user friendly, is SQL-like in
terms of expressing dataflows, has the flexibility to manage multistep data
transformations, and handles joins with simplicity and easy program flow.
Apache Pig was the first system to provide a higher-level language on top of
Hadoop. Pig started as an internal research project at Yahoo (one of the early
adopters of Hadoop) but due to its popularity subsequently was promoted
to a production-level system and adopted as an open-source project by the
Search WWH ::




Custom Search