Databases Reference
In-Depth Information
CHAPTER 14
Data Engineering: MapReduce,
Pregel, and Hadoop
We have two contributors to this chapter, David Crawshaw and Josh
Wills. Rachel worked with both of them at Google on the Google+ data
science team, though the two of them never actually worked together
because Josh Wills left to go to Cloudera and David Crawshaw replaced
him in the role of tech lead. We can call them “data engineers,” although
that term might be as problematic (or potentially overloaded) or am‐
biguous as “data scientist”—but suffice it to say that they've both
worked as software engineers and dealt with massive amounts of data.
If we look at the data science process from Chapter 2 , Josh and David
were responsible at Google for collecting data (frontend and backend
logging), building the massive data pipelines to store and munge the
data, and building up the engineering infrastructure to support anal‐
ysis, dashboards, analytics, A/B testing, and more broadly, data
science.
In this chapter we'll hear firsthand from Google engineers about Map‐
Reduce, which was developed at Google, and then open source ver‐
sions were created elsewhere. MapReduce is an algorithm and
framework for dealing with massive amounts of data that has recently
become popular in industry. The goal of this chapter is to clear up
some of the mysteriousness surrounding MapReduce. It's become such
a buzzword, and many data scientist job openings are advertised as
saying “must know Hadoop” (the open source implementation of
MapReduce). We suspect these ads are written by HR departments
who don't really understand what MapReduce is good for and the fact
that not all data science problems require MapReduce. But as it's
 
Search WWH ::




Custom Search