Database Reference
In-Depth Information
Joins
MapReduce can perform joins between large datasets, but writing the code to do joins from
scratch is fairly involved. Rather than writing MapReduce programs, you might consider
using a higher-level framework such as Pig, Hive, Cascading, Cruc, or Spark, in which join
operations are a core part of the implementation.
Let's briefly consider the problem we are trying to solve. We have two datasets — for ex-
ample, the weather stations database and the weather records — and we want to reconcile
the two. Let's say we want to see each station's history, with the station's metadata inlined
in each output row. This is illustrated in Figure 9-2 .
How we implement the join depends on how large the datasets are and how they are parti-
tioned. If one dataset is large (the weather records) but the other one is small enough to be
distributed to each node in the cluster (as the station metadata is), the join can be effected
by a MapReduce job that brings the records for each station together (a partial sort on sta-
tion ID, for example). The mapper or reducer uses the smaller dataset to look up the station
metadata for a station ID, so it can be written out with each record. See Side Data Distribu-
tion for a discussion of this approach, where we focus on the mechanics of distributing the
data to nodes in the cluster.
Search WWH ::




Custom Search