Database Reference
In-Depth Information
9
Building Data Transformation
Workflows with Pig
and Cascading
C ollecting and processing large amounts of data can be a complicated task. Fortu-
nately, many common data processing challenges can be broken down into smaller
problems. Open-source software tools allow us to shard and distribute data transfor-
mation jobs across many machines, using strategies such as MapReduce.
Although frameworks like Hadoop help manage much of the complexity of taking
large MapReduce processing tasks and farming them out to individual machines in a
cluster, we still need to define exactly how the data will be processed. Do we want to
alter the data in some way? Should we split it up or combine it with another source?
With large amounts of data coming from many different sources, chaining together
multiple data processing tasks into a complex pipeline directly using MapReduce func-
tions or streaming API scripts can quickly get out of hand. Sometimes a single data
concept might require several MapReduce steps, resulting in hard-to-manage code. It's
far more practical to abstract the problem even further, by defining workf lows that in
turn dictate underlying MapReduce operations.
Imagine how hard it would be to explain how your MapReduce job works to the
average person on the street. Instead of defining how your mapper, reducer, and com-
biner steps work, you would probably just say, “Well, I took this big collection of data,
joined each record with some data from somewhere else, and saved the results in a new
set of files.”
This chapter is all about using tools that help us work at a level of human-friendly
abstraction. We will take a look at two very popular but very different open-source
tools for managing the complexity of multistep data transformation pipelines: Pig and
Cascading.
 
 
 
Search WWH ::




Custom Search