Database Reference
In-Depth Information
addition,likeallHadoopprocessing,itreliesonmap-reducejobsthatcanbe
run in parallel on separate chunks of data and combined after the analysis
to arrive at a result. For example, it would be ideal to look through massive
amounts of data measurements like temperatures, group them by days, and
reduce it to the max temperature by day. Another factor to keep in mind is
the latency involved in the batch processing of the data. This means that Pig
processing issuitable forpost-processing ofthedata asopposedtoreal-time
processing that occurs as the data is collected.
You can run Pig either interactively or in batch mode. Typically interactive
mode is used during development. When you run Pig interactively, you can
easily see the results of the scripts dumped out to the screen. This is a great
way to build up and debug a multi-step ETL process. Once the script is built,
you can save it to a text file and run it in batch mode using scheduling or
part of a workflow. This generally occurs during production where scripts
are run unattended during off-peak hours. The results can be dumped into a
file that you can use for further analysis or as an input file for tools, such as
PowerPivot, Power View, and Power Map. (You will see how these tools are
used in Chapter 11, “Visualizing Big Data with Microsoft BI.”)
Taking Advantage of Built-in Functions
As you saw in Chapter 8, “Effective Big Data ETL with SSIS, Pig, and
SQOOP,” Pig scripts are written in a script language called Pig Latin .
Although it is a lot easier to write the ETL processing using Pig Latin than it
is to write the low level map-reduce jobs, at some point the Pig Latin has to
be converted into a map-reduce job that does the actual processing. This is
wherefunctionscomeintothepicture.InPig, functions processthedataand
are written in Java. Pig comes with a set of built-in functions to implement
common processing tasks such as the following:
• Loading and storing data
• Evaluating and aggregating data
• Executing common math functions
• Implementing string functions
For example, the default load function PigStorage is used to load data
into structured text files in UTF-8 format. The following code loads a file
containing flight delay data into a relation (table) named FlightData :
Search WWH ::




Custom Search