Putting It Together: MapReduce Data Pipelines - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Tasks that require very little work to distribute across a number of machines are

known as embarrassingly parallel problems—probably because this type of problem

provides you with an embarrassment of data riches. Or it might mean that you tell

your boss, “I'm embarrassed to say this, but all I really had to do to earn my paycheck

was to process your 200GB dataset by writing a 20-line Python script.”

The overall concept behind MapReduce is fairly simple, but the logistics of coordi-

nating all the shards, shuff ling, and subprocesses can become very complex, especially

when data sizes get large. First of all, imagine having to build a system that keeps track

of the mappings of shards of data being processed. Likewise, dealing with the coor-

dination of separate machines passing messages from one to another is daunting. The

shuff le phase of MapReduce can be fairly tricky as well; when you are dealing with

millions of independently processed shards of data, the process of efficiently sorting

them all into their respective reducer steps can take quite a long time. Because of these

challenges, building your own custom MapReduce framework is not a practical idea.

Instead, it's better to avoid reinventing the wheel; use an available framework.

There are many open-source frameworks that provide MapReduce functionality,

but the most popular is undoubtedly Apache Hadoop. Hadoop handles the complexi-

ties of coordinating MapReduce jobs along with providing tools such as a distributed

filesystem for storing data across multiple machines.

Hadoop is written primarily in the Java programming language. Many MapReduce

applications are created directly with Hadoop in Java. Although interacting directly

with the Hadoop Java APIs can be performant, the framework itself can be a bit com-

plex for the uninitiated. Furthermore, some developers may prefer to use a different

language, such as Python.

Fortunately, it's possible to build components of our data pipeline in almost any

language we want using the Hadoop streaming utility. Hadoop streaming allows

developers to use arbitrary applications as the mapper and reducer, provided that they

can read and write from the standard data pipeline streams—better known as stan-

dard input (stdin) and standard output (stdout). Even if you are not familiar with the

concept of stdin or stdout, if you have ever typed a single command on the command

line, you have used them.

Let's take a look at how to build simple single-step and multiple-step data pipe-

lines using the Hadoop streaming API. Our examples use scripts written in Python to

transform data stored in a Hadoop HDFS filesystem via a MapReduce job. We will

start with unstructured source-text documents, extract structured data, and save the

result in new files suitable for loading into an aggregate analysis or visualization tool.

The Simplest Pipeline: stdin to stdout

Before we take a look at how to start building large, distributed pipelines for trans-

forming large amounts of data, let's take a look back at a familiar standard for pipeline

programming: the Unix command line. One of the fundamental control structures

found on the Unix command line are pipes. Unix pipes work with data in much the

same way real pipes deal with water and transport video game plumbers: Something

Search WWH ::

Custom Search

Home