Advanced MapReduce - Hadoop in Action

Databases Reference

In-Depth Information

to a reducer that may be at a completely different node. Formally this is considered to

be passed by value , as a copy of the key/value pair is sent over. In the current case where

we can chain one Mapper to another, we can execute the two in the same JVM thread.

Therefore, it's possible for the key/value pairs to be passed by reference , where the output

of the initial Mapper stays in place in memory and the following Mapper refers to it

directly in the same memory location. When Map1 calls OutputCollector.collect

(K k, V v) , the objects k and v pass directly to Map2 's map() method. This improves

performance

by not having to clone a potentially large volume of data between the

mappers. But doing this can violate one of the more subtle “contracts” in Hadoop's

MapReduce API. The call to OutputCollector.collect(K k, V v) is guaranteed

to not alter the content of k and v . Map1 can call OutputCollector.collect(K k,

V v) and then use the objects k and v afterward, fully expecting their values to stay the

same. But if we pass those objects by reference to Map2 , then Map2 may alter them and

violate the API's guarantee. If you're sure that Map1 's map() method doesn't use the

content of k and v after calling OutputCollector.collect(K k, V v) , or that Map2

doesn't change the value of its k and v input, you can achieve some performance gains

by setting byValue to false. If you're not sure of the Mapper 's internal code, it's best to

play safe and let byValue be true, maintaining the pass-by-value model, and be certain

that the Mappers will work as expected.

5.2

Joining data from different sources

It's inevitable that you'll come across data analyses where you need to pull in data from

different sources. For example, given our patent data sets, you may want to find out

if certain countries cite patents from another country. You'll have to look at citation

data

( apat63_99.

txt ). In the database world it would just be a matter of joining two tables, and most

databases automagically take care of the join processing

( cite75_99.txt ) as well as patent data for country information

for you. Unfortunately, join-

ing data in Hadoop is more involved, and there are several possible approaches with

different trade-offs.

We use a couple toy data sets to better illustrate joining in Hadoop. Let's take a

comma-separated Customers file where each record has three fields: Customer ID,

Name, and Phone Number. We put four records in the file for illustration:

1,Stephanie Leung,555-555-5555

2,Edward Kim,123-456-7890

3,Jose Madriz,281-330-8004

4,David Stork,408-555-0000

We store Customer orders in a separate file, called Orders. It's in CSV format, with four

fields: Customer ID, Order ID, Price, and Purchase Date.

3,A,12.95,02-Jun-2008

1,B,88.25,20-May-2008

Hadoop in Action

Search WWH ::

Custom Search

Home