Cookbook - Hadoop in Action

Databases Reference

In-Depth Information

After executing the preceding program, we can see that the output directory now has

a separate directory for each country.

ls output/

AD BN CS GE IN LC MT PH SV VE

AE BO CU GF IQ LI MU PK SY VG

AG BR CY GH IR LK MW PL SZ VN

AI BS CZ GL IS LR MX PT TC VU

AL BY DE GN IT LT MY PY TD YE

AM BZ DK GP JM LU NC RO TH YU

AN CA DO GR JO LV NF RU TN ZA

AR CC DZ GT JP LY NG SA TR ZM

AT CD EC GY KE MA NI SD TT ZW

AU CH EE HK KG MC NL SE TW

AW CI EG HN KN MG NO SG TZ

AZ CK ES HR KP MH NZ SI UA

BB CL ET HT KR ML OM SK UG

BE CM FI HU KW MM PA SM US

BG CN FO ID KY MO PE SN UY

BH CO FR IE KZ MQ PF SR UZ

BM CR GB IL LB MR PG SU VC

And within the directory for each country are files with only records (patents) created

by those countries.

ls output/AD

part-00003 part-00005 part-00006

head output/AD/part-00006

5765303,1998,14046,1996,"AD","",,1,12,42,5,59,11,1,0.4545,0,0,1,67.3636,,,,

5785566,1998,14088,1996,"AD","",,1,9,441,6,69,3,0,1,,0.6667,,4.3333,,,,

5894770,1999,14354,1997,"AD","",,1,,82,5,51,4,0,1,,0.625,,7.5,,,,

We've written this simple partitioning exercise as a map-only program. You can apply

the same technique to the output of reducers as well. Be careful not to confuse this

with the partitioner in the MapReduce framework. That partitioner looks at the keys

of intermediate records and decides which reducer will process them. The partitioning

we're doing here looks at the key/value pair of the output and decides which file to

store to.

MultipleOutputFormat is simple, but it's also limited. For example, we were able to

split the input data by row, but what if we want to split by column? Let's say we want to create

two data sets from the patent metadata: one containing time-related information (e.g.,

publication date) for each patent and another one containing geographical information

(e.g., country of invention). These two data sets may be of different output formats

and different data types for the keys and values. We can look to MultipleOutputs ,

introduced in version 0.19 of Hadoop, for more powerful capabilities.

The approach taken by MultipleOutputs is different from MultipleOutputFormat .

Rather than asking for the filename to output each record, MultipleOutputs creates

multiple OutputCollectors . Each OutputCollector can have its own OutputFormat

and types for the key/value pair. Your MapReduce program will decide what to output

to each OutputCollector . Listing 7.2 shows a program that takes our patent metadata

Hadoop in Action

Search WWH ::

Custom Search

Home