Data Engineering: MapReduce, Pregel, and Hadoop - Doing Data Science

Databases Reference

In-Depth Information

CHAPTER 14

Data Engineering: MapReduce,

Pregel, and Hadoop

We have two contributors to this chapter, David Crawshaw and Josh

Wills. Rachel worked with both of them at Google on the Google+ data

science team, though the two of them never actually worked together

because Josh Wills left to go to Cloudera and David Crawshaw replaced

him in the role of tech lead. We can call them “data engineers,” although

that term might be as problematic (or potentially overloaded) or am‐

biguous as “data scientist”—but suffice it to say that they've both

worked as software engineers and dealt with massive amounts of data.

If we look at the data science process from Chapter 2 , Josh and David

were responsible at Google for collecting data (frontend and backend

logging), building the massive data pipelines to store and munge the

data, and building up the engineering infrastructure to support anal‐

ysis, dashboards, analytics, A/B testing, and more broadly, data

science.

In this chapter we'll hear firsthand from Google engineers about Map‐

Reduce, which was developed at Google, and then open source ver‐

sions were created elsewhere. MapReduce is an algorithm and

framework for dealing with massive amounts of data that has recently

become popular in industry. The goal of this chapter is to clear up

some of the mysteriousness surrounding MapReduce. It's become such

a buzzword, and many data scientist job openings are advertised as

saying “must know Hadoop” (the open source implementation of

MapReduce). We suspect these ads are written by HR departments

who don't really understand what MapReduce is good for and the fact

that not all data science problems require MapReduce. But as it's

Search WWH ::

Custom Search

Home