Databases Reference
In-Depth Information
to a reducer that may be at a completely different node. Formally this is considered to
be
passed by value
, as a copy of the key/value pair is sent over. In the current case where
we can chain one
Mapper
to another, we can execute the two in the same JVM thread.
Therefore, it's possible for the key/value pairs to be
passed by reference
, where the output
of the initial
Mapper
stays in place in memory and the following
Mapper
refers to it
directly in the same memory location. When
Map1
calls
OutputCollector.collect
(K k, V v)
, the objects
k
and
v
pass directly to
Map2
's
map()
method. This improves
performance
by not having to clone a potentially large volume of data between the
mappers. But doing this can violate one of the more subtle “contracts” in Hadoop's
MapReduce API. The call to
OutputCollector.collect(K k, V v)
is guaranteed
to not alter the content of
k
and
v
.
Map1
can call
OutputCollector.collect(K k,
V v)
and then use the objects
k
and
v
afterward, fully expecting their values to stay the
same. But if we pass those objects by reference to
Map2
, then
Map2
may alter them and
violate the API's guarantee. If you're sure that
Map1
's
map()
method doesn't use the
content of
k
and
v
after calling
OutputCollector.collect(K k, V v)
, or that
Map2
doesn't change the value of its
k
and
v
input, you can achieve some performance gains
by setting
byValue
to false. If you're not sure of the
Mapper
's internal code, it's best to
play safe and let
byValue
be true, maintaining the pass-by-value model, and be certain
that the
Mappers
will work as expected.
5.2
Joining data from different sources
It's inevitable that you'll come across data analyses where you need to pull in data from
different sources. For example, given our patent data sets, you may want to find out
if certain countries cite patents from another country. You'll have to look at citation
data
(
apat63_99.
txt
). In the database world it would just be a matter of joining two tables, and most
databases automagically take care of the join processing
(
cite75_99.txt
) as well as patent data for country information
for you. Unfortunately, join-
ing data in Hadoop is more involved, and there are several possible approaches with
different trade-offs.
We use a couple toy data sets to better illustrate joining in Hadoop. Let's take a
comma-separated Customers file where each record has three fields: Customer ID,
Name, and Phone Number. We put four records in the file for illustration:
1,Stephanie Leung,555-555-5555
2,Edward Kim,123-456-7890
3,Jose Madriz,281-330-8004
4,David Stork,408-555-0000
We store Customer orders in a separate file, called Orders. It's in CSV format, with four
fields: Customer ID, Order ID, Price, and Purchase Date.
3,A,12.95,02-Jun-2008
1,B,88.25,20-May-2008