Database Reference
In-Depth Information
Notice that all of the unwanted characters have now been removed from the output. Given this simple example,
you will be able to build your own UDF extentions to Pig. You are now able to use Apache Pig to code your Map
Reduce jobs and expand its functionality via UDFs. You can also gain the greater flexibility of reduced code volume by
using your own UDF libraries.
In the next section, you will tackle the same Map Reduce job using Apache Hive, the big-data data warehouse.
A similar word-count algorithm will be presented using Hive QL, Hive's SQL-like query language. All of these methods
and scripts for creating Map Reduce jobs are presented to give you a sample of each approach. The data that you want
to use and the data architecture that you choose will govern which route you take to create your jobs. The ETL tools
that you choose will also affect your approach—for instance, Talend and Pentaho, to be discussed in Chapter 10, can
integrate well with Pig functionality.
â–  For more information on apache pig and pig Latin, see the apache software Foundation guide at
http://pig.apache.org/docs/r0.12.1/start.html .
Note
Map Reduce with Hive
This next example involves installing Apache Hive from hive.apache.org . Hive is a data warehouse system that uses
Hadoop for storage. It is possible to interrogate data on HDFS by using an SQL-like language called HiveQL. Hive can
represent HDFS-based data via the use of external tables (described in later chapters) or relational data where there
are relationships between data in different Hive tables.
In this section, I will explain how to source and install Hive, followed by a simple word-count job on the same
data as used previously.
Installing Hive
When downloading and installing Hive, be sure to choose the version compatible with the version of Hadoop you
are using in conjunction with this topic. For this section's examples, I chose version 0.13.1, which is compatible with
Hadoop version 1.2.1 used earlier. As before, I used wget from the Linux command line to download the tarred and
gzipped release from a suggested mirror site:
[hadoop@hc1nn Downloads]$ wget http://apache.mirror.quintex.com/hive/hive-0.13.1/apache-hive-0.13.1-
bin.tar.gz
[hadoop@hc1nn Downloads]$ ls -l apache-hive-0.13.1-bin.tar.gz
-rw-rw-r--. 1 hadoop hadoop 54246778 Jun 3 07:31 apache-hive-0.13.1-bin.tar.gz
As before, you unpack the software using the Linux commands gunzip and tar :
[hadoop@hc1nn Downloads]$ gunzip apache-hive-0.13.1-bin.tar.gz
[hadoop@hc1nn Downloads]$ tar xvf apache-hive-0.13.1-bin.tar
[hadoop@hc1nn Downloads]$ ls -ld apache-hive-0.13.1-bin
drwxrwxr-x. 8 hadoop hadoop 4096 Jun 18 17:03 apache-hive-0.13.1-bin
 
 
Search WWH ::




Custom Search