Database Reference
In-Depth Information
Other functions and the details of these built-in functions can be found at the
pig.apache.org website [25].
In terms of extensibility, Pig allows the execution of user-defined functions (UDFs)
in its environment. Thus, some complex operations can be coded in the user's
language of choice and executed in the Pig environment. Users can share their
UDFs in a repository called the Piggybank hosted on the Apache site [26]. Over
time, the most useful UDFs may be included as built-in functions in Pig.
10.2.2 Hive
Similar to Pig, Apache Hive enables users to process data without explicitly writing
MapReduce code. One key difference to Pig is that the Hive language, HiveQL
(Hive Query Language), resembles Structured Query Language (SQL) rather than
a scripting language.
A Hive table structure consists of rows and columns. The rows typically correspond
to some record, transaction, or particular entity (for example, customer) detail.
The values of the corresponding columns represent the various attributes or
characteristics for each row. Hadoop and its ecosystem are used to apply some
structure to unstructured data. Therefore, if a table structure is an appropriate way
to view the restructured data, Hive may be a good tool to use.
Additionally, a user may consider using Hive if the user has experience with SQL
and the data is already in HDFS. Another consideration in using Hive may be how
data will be updated or added to the Hive tables. If data will simply be added to a
table periodically, Hive works well, but if there is a need to update data in place, it
may be beneficial to consider another tool, such as HBase, which will be discussed
in the next section.
Although Hive's performance may be better in certain applications than a
conventional SQL database, Hive is not intended for real-time querying. A Hive
query is first translated into a MapReduce job, which is then submitted to the
Hadoop cluster. Thus, the execution of the query has to compete for resources with
any other submitted job. Like Pig, Hive is intended for batch processing. Again,
HBase may be a better choice for real-time query needs.
To summarize the preceding discussion, consider using Hive when the following
conditions exist:
• Data easily fits into a table structure.
Search WWH ::




Custom Search