Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Other functions and the details of these built-in functions can be found at the

pig.apache.org website [25].

In terms of extensibility, Pig allows the execution of user-defined functions (UDFs)

in its environment. Thus, some complex operations can be coded in the user's

language of choice and executed in the Pig environment. Users can share their

UDFs in a repository called the Piggybank hosted on the Apache site [26]. Over

time, the most useful UDFs may be included as built-in functions in Pig.

10.2.2 Hive

Similar to Pig, Apache Hive enables users to process data without explicitly writing

MapReduce code. One key difference to Pig is that the Hive language, HiveQL

(Hive Query Language), resembles Structured Query Language (SQL) rather than

a scripting language.

A Hive table structure consists of rows and columns. The rows typically correspond

to some record, transaction, or particular entity (for example, customer) detail.

The values of the corresponding columns represent the various attributes or

characteristics for each row. Hadoop and its ecosystem are used to apply some

structure to unstructured data. Therefore, if a table structure is an appropriate way

to view the restructured data, Hive may be a good tool to use.

Additionally, a user may consider using Hive if the user has experience with SQL

and the data is already in HDFS. Another consideration in using Hive may be how

data will be updated or added to the Hive tables. If data will simply be added to a

table periodically, Hive works well, but if there is a need to update data in place, it

may be beneficial to consider another tool, such as HBase, which will be discussed

in the next section.

Although Hive's performance may be better in certain applications than a

conventional SQL database, Hive is not intended for real-time querying. A Hive

query is first translated into a MapReduce job, which is then submitted to the

Hadoop cluster. Thus, the execution of the query has to compete for resources with

any other submitted job. Like Pig, Hive is intended for batch processing. Again,

HBase may be a better choice for real-time query needs.

To summarize the preceding discussion, consider using Hive when the following

conditions exist:

• Data easily fits into a table structure.

Search WWH ::

Custom Search

Home