Databases Reference
In-Depth Information
to a reducer that may be at a completely different node. Formally this is considered to
be passed by value , as a copy of the key/value pair is sent over. In the current case where
we can chain one Mapper to another, we can execute the two in the same JVM thread.
Therefore, it's possible for the key/value pairs to be passed by reference , where the output
of the initial Mapper stays in place in memory and the following Mapper refers to it
directly in the same memory location. When Map1 calls OutputCollector.collect
(K k, V v) , the objects k and v pass directly to Map2 's map() method. This improves
performance
by not having to clone a potentially large volume of data between the
mappers. But doing this can violate one of the more subtle “contracts” in Hadoop's
MapReduce API. The call to OutputCollector.collect(K k, V v) is guaranteed
to not alter the content of k and v . Map1 can call OutputCollector.collect(K k,
V v) and then use the objects k and v afterward, fully expecting their values to stay the
same. But if we pass those objects by reference to Map2 , then Map2 may alter them and
violate the API's guarantee. If you're sure that Map1 's map() method doesn't use the
content of k and v after calling OutputCollector.collect(K k, V v) , or that Map2
doesn't change the value of its k and v input, you can achieve some performance gains
by setting byValue to false. If you're not sure of the Mapper 's internal code, it's best to
play safe and let byValue be true, maintaining the pass-by-value model, and be certain
that the Mappers will work as expected.
5.2
Joining data from different sources
It's inevitable that you'll come across data analyses where you need to pull in data from
different sources. For example, given our patent data sets, you may want to find out
if certain countries cite patents from another country. You'll have to look at citation
data
( apat63_99.
txt ). In the database world it would just be a matter of joining two tables, and most
databases automagically take care of the join processing
( cite75_99.txt ) as well as patent data for country information
for you. Unfortunately, join-
ing data in Hadoop is more involved, and there are several possible approaches with
different trade-offs.
We use a couple toy data sets to better illustrate joining in Hadoop. Let's take a
comma-separated Customers file where each record has three fields: Customer ID,
Name, and Phone Number. We put four records in the file for illustration:
1,Stephanie Leung,555-555-5555
2,Edward Kim,123-456-7890
3,Jose Madriz,281-330-8004
4,David Stork,408-555-0000
We store Customer orders in a separate file, called Orders. It's in CSV format, with four
fields: Customer ID, Order ID, Price, and Purchase Date.
3,A,12.95,02-Jun-2008
1,B,88.25,20-May-2008
 
Search WWH ::




Custom Search