Database Reference
In-Depth Information
Tasks that require very little work to distribute across a number of machines are
known as embarrassingly parallel problems—probably because this type of problem
provides you with an embarrassment of data riches. Or it might mean that you tell
your boss, “I'm embarrassed to say this, but all I really had to do to earn my paycheck
was to process your 200GB dataset by writing a 20-line Python script.”
The overall concept behind MapReduce is fairly simple, but the logistics of coordi-
nating all the shards, shuff ling, and subprocesses can become very complex, especially
when data sizes get large. First of all, imagine having to build a system that keeps track
of the mappings of shards of data being processed. Likewise, dealing with the coor-
dination of separate machines passing messages from one to another is daunting. The
shuff le phase of MapReduce can be fairly tricky as well; when you are dealing with
millions of independently processed shards of data, the process of efficiently sorting
them all into their respective reducer steps can take quite a long time. Because of these
challenges, building your own custom MapReduce framework is not a practical idea.
Instead, it's better to avoid reinventing the wheel; use an available framework.
There are many open-source frameworks that provide MapReduce functionality,
but the most popular is undoubtedly Apache Hadoop. Hadoop handles the complexi-
ties of coordinating MapReduce jobs along with providing tools such as a distributed
filesystem for storing data across multiple machines.
Hadoop is written primarily in the Java programming language. Many MapReduce
applications are created directly with Hadoop in Java. Although interacting directly
with the Hadoop Java APIs can be performant, the framework itself can be a bit com-
plex for the uninitiated. Furthermore, some developers may prefer to use a different
language, such as Python.
Fortunately, it's possible to build components of our data pipeline in almost any
language we want using the Hadoop streaming utility. Hadoop streaming allows
developers to use arbitrary applications as the mapper and reducer, provided that they
can read and write from the standard data pipeline streams—better known as stan-
dard input (stdin) and standard output (stdout). Even if you are not familiar with the
concept of stdin or stdout, if you have ever typed a single command on the command
line, you have used them.
Let's take a look at how to build simple single-step and multiple-step data pipe-
lines using the Hadoop streaming API. Our examples use scripts written in Python to
transform data stored in a Hadoop HDFS filesystem via a MapReduce job. We will
start with unstructured source-text documents, extract structured data, and save the
result in new files suitable for loading into an aggregate analysis or visualization tool.
The Simplest Pipeline: stdin to stdout
Before we take a look at how to start building large, distributed pipelines for trans-
forming large amounts of data, let's take a look back at a familiar standard for pipeline
programming: the Unix command line. One of the fundamental control structures
found on the Unix command line are pipes. Unix pipes work with data in much the
same way real pipes deal with water and transport video game plumbers: Something
 
Search WWH ::




Custom Search