MapReduce Features - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Joins

MapReduce can perform joins between large datasets, but writing the code to do joins from

scratch is fairly involved. Rather than writing MapReduce programs, you might consider

using a higher-level framework such as Pig, Hive, Cascading, Cruc, or Spark, in which join

operations are a core part of the implementation.

Let's briefly consider the problem we are trying to solve. We have two datasets — for ex-

ample, the weather stations database and the weather records — and we want to reconcile

the two. Let's say we want to see each station's history, with the station's metadata inlined

in each output row. This is illustrated in Figure 9-2 .

How we implement the join depends on how large the datasets are and how they are parti-

tioned. If one dataset is large (the weather records) but the other one is small enough to be

distributed to each node in the cluster (as the station metadata is), the join can be effected

by a MapReduce job that brings the records for each station together (a partial sort on sta-

tion ID, for example). The mapper or reducer uses the smaller dataset to look up the station

metadata for a station ID, so it can be written out with each record. See Side Data Distribu-

tion for a discussion of this approach, where we focus on the mechanics of distributing the

data to nodes in the cluster.

Search WWH ::

Custom Search

Home