Database Reference
In-Depth Information
Secondary sort
As described, the reducer will see the records from both sources that have the same
key, but they are not guaranteed to be in any particular order. However, to perform the
join, it is important to have the data from one source before that from the other. For the
weather data join, the station record must be the first of the values seen for each key, so
the reducer can fill in the weather records with the station name and emit them
straightaway. Of course, it would be possible to receive the records in any order if we
buffered them in memory, but this should be avoided because the number of records in
any group may be very large and exceed the amount of memory available to the redu-
cer.
We saw in
Secondary Sort
how to impose an order on the values for each key that the
reducers see, so we use this technique here.
station ID) and the tag. The only requirement for the tag values is that they sort in such a
way that the station records come before the weather records. This can be achieved by
tagging station records as
0
and weather records as
1
. The mapper classes to do this are
Example 9-9. Mapper for tagging station records for a reduce-side join
public class
JoinStationMapper
extends
Mapper
<
LongWritable
,
Text
,
TextPair
,
Text
> {
private
NcdcStationMetadataParser parser
=
new
NcdcStationMetadataParser
();
@Override
protected
void
map
(
LongWritable key
,
Text value
,
Context context
)
throws
IOException
,
InterruptedException
{
if
(
parser
.
parse
(
value
)) {
context
.
write
(
new
TextPair
(
parser
.
getStationId
(),
"0"
),
new
Text
(
parser
.
getStationName
()));
}
}
}
Example 9-10. Mapper for tagging weather records for a reduce-side join
public class
JoinRecordMapper
extends
Mapper
<
LongWritable
,
Text
,
TextPair
,
Text
> {
private
NcdcRecordParser parser
=
new
NcdcRecordParser
();
@Override
protected
void
map
(
LongWritable key
,
Text value
,
Context context
)