Database Reference
In-Depth Information
Map-Side Joins
A map-side join between large inputs works by performing the join before the data
reaches the map function. For this to work, though, the inputs to each map must be parti-
tioned and sorted in a particular way. Each input dataset must be divided into the same
number of partitions, and it must be sorted by the same key (the join key) in each source.
All the records for a particular key must reside in the same partition. This may sound like
a strict requirement (and it is), but it actually fits the description of the output of a MapRe-
duce job.
A map-side join can be used to join the outputs of several jobs that had the same number
of reducers, the same keys, and output files that are not splittable (by virtue of being smal-
ler than an HDFS block or being gzip compressed, for example). In the context of the
weather example, if we ran a partial sort on the stations file by station ID, and another
identical sort on the records, again by station ID and with the same number of reducers,
then the two outputs would satisfy the conditions for running a map-side join.
You use a CompositeInputFormat from the
org.apache.hadoop.mapreduce.join package to run a map-side join. The input
sources and join type (inner or outer) for CompositeInputFormat are configured
through a join expression that is written according to a simple grammar. The package doc-
umentation has details and examples.
The org.apache.hadoop.examples.Join example is a general-purpose
command-line program for running a map-side join, since it allows you to run a MapRe-
duce job for any specified mapper and reducer over multiple inputs that are joined with a
given join operation.
Reduce-Side Joins
A reduce-side join is more general than a map-side join, in that the input datasets don't
have to be structured in any particular way, but it is less efficient because both datasets
have to go through the MapReduce shuffle. The basic idea is that the mapper tags each re-
cord with its source and uses the join key as the map output key, so that the records with
the same key are brought together in the reducer. We use several ingredients to make this
work in practice:
Multiple inputs
The input sources for the datasets generally have different formats, so it is very con-
venient to use the MultipleInputs class (see Multiple Inputs ) to separate the logic
for parsing and tagging each source.
Search WWH ::




Custom Search