Building Data Transformation Workflows with Pig and Cascading - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

9

Workflows with Pig

and Cascading

C ollecting and processing large amounts of data can be a complicated task. Fortu-

nately, many common data processing challenges can be broken down into smaller

problems. Open-source software tools allow us to shard and distribute data transfor-

mation jobs across many machines, using strategies such as MapReduce.

Although frameworks like Hadoop help manage much of the complexity of taking

large MapReduce processing tasks and farming them out to individual machines in a

cluster, we still need to define exactly how the data will be processed. Do we want to

alter the data in some way? Should we split it up or combine it with another source?

With large amounts of data coming from many different sources, chaining together

multiple data processing tasks into a complex pipeline directly using MapReduce func-

tions or streaming API scripts can quickly get out of hand. Sometimes a single data

concept might require several MapReduce steps, resulting in hard-to-manage code. It's

far more practical to abstract the problem even further, by defining workf lows that in

turn dictate underlying MapReduce operations.

Imagine how hard it would be to explain how your MapReduce job works to the

average person on the street. Instead of defining how your mapper, reducer, and com-

biner steps work, you would probably just say, “Well, I took this big collection of data,

joined each record with some data from somewhere else, and saved the results in a new

set of files.”

This chapter is all about using tools that help us work at a level of human-friendly

abstraction. We will take a look at two very popular but very different open-source

tools for managing the complexity of multistep data transformation pipelines: Pig and

Cascading.

Search WWH ::

Custom Search

Home