Cloudware Application Development - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

The main difference between Hive and the other languages previously

discussed comes from the fact that Hive's design is more influenced by

classic relational warehousing systems, which is evident both at the data

model and at the query language level. Hive thinks of its data in relational

terms—data sources are stored in tables, consisting of a fixed number

of rows with predefined data types. Similar to Pig and Jaql, Hive's data

model provides support for semistructured and nested data in the form of

complex data types like associative arrays (maps), lists, and structs, which

facilitates the use of denormalized inputs. On the other hand, Hive dif-

fers from the other higher-level languages for Hadoop in the fact that it

uses a catalog to hold metadata about its input sources. This means that

the table schema must be declared and the data loaded before any queries

involving the table are submitted to the system (which mirrors the stan-

dard RDBMS process). The schema definition language extends the classic

DDL CREATE TABLE syntax. Currently, Hive does not provide support

for updates, which means that any data load statement will enforce the

removal of any old data in the specified target table or partition. The stan-

dard way to append data to an existing table in Hive is to create a new

partition for each append set. Since appends in an OLAP environment are

typically performed periodically in a batch manner, this strategy is a good

fit for most real-world scenarios.

The Hive Query Language (HiveQL) is an SQL dialect with various syntax

extensions. HiveQL supports many traditional SQL features like from clause

subqueries, various join types, group bys and aggregations, as well as many

useful built-in data processing functions that provide an intuitive syntax for

writing Hive queries to all users familiar with the SQL basics. In addition,

HiveQL provides native support for in-line MapReduce job specification. The

semantics of the mapper and the reducer are specified in external scripts,

which communicate with the parent Hadoop task through the standard

input and output streams (similar to the streaming API for user-defined

functions (UDFs) in Pig).

17.3.4 Pig

As data volumes and processing complexities increase, analyzing large data

sets introduces dataflow complexities that become harder to implement

in a MapReduce program. There was a need for an abstraction layer over

MapReduce: a high-level language that is more user friendly, is SQL-like in

terms of expressing dataflows, has the flexibility to manage multistep data

transformations, and handles joins with simplicity and easy program flow.

Apache Pig was the first system to provide a higher-level language on top of

Hadoop. Pig started as an internal research project at Yahoo (one of the early

adopters of Hadoop) but due to its popularity subsequently was promoted

to a production-level system and adopted as an open-source project by the

Search WWH ::

Custom Search

Home