Databases Reference
In-Depth Information
After executing the preceding program, we can see that the output directory now has
a separate directory for each country.
ls output/
AD BN CS GE IN LC MT PH SV VE
AE BO CU GF IQ LI MU PK SY VG
AG BR CY GH IR LK MW PL SZ VN
AI BS CZ GL IS LR MX PT TC VU
AL BY DE GN IT LT MY PY TD YE
AM BZ DK GP JM LU NC RO TH YU
AN CA DO GR JO LV NF RU TN ZA
AR CC DZ GT JP LY NG SA TR ZM
AT CD EC GY KE MA NI SD TT ZW
AU CH EE HK KG MC NL SE TW
AW CI EG HN KN MG NO SG TZ
AZ CK ES HR KP MH NZ SI UA
BB CL ET HT KR ML OM SK UG
BE CM FI HU KW MM PA SM US
BG CN FO ID KY MO PE SN UY
BH CO FR IE KZ MQ PF SR UZ
BM CR GB IL LB MR PG SU VC
And within the directory for each country are files with only records (patents) created
by those countries.
ls output/AD
part-00003 part-00005 part-00006
head output/AD/part-00006
5765303,1998,14046,1996,"AD","",,1,12,42,5,59,11,1,0.4545,0,0,1,67.3636,,,,
5785566,1998,14088,1996,"AD","",,1,9,441,6,69,3,0,1,,0.6667,,4.3333,,,,
5894770,1999,14354,1997,"AD","",,1,,82,5,51,4,0,1,,0.625,,7.5,,,,
We've written this simple partitioning exercise as a map-only program. You can apply
the same technique to the output of reducers as well. Be careful not to confuse this
with the partitioner in the MapReduce framework. That partitioner looks at the keys
of intermediate records and decides which reducer will process them. The partitioning
we're doing here looks at the key/value pair of the output and decides which file to
store to.
MultipleOutputFormat is simple, but it's also limited. For example, we were able to
split the input data by row, but what if we want to split by column? Let's say we want to create
two data sets from the patent metadata: one containing time-related information (e.g.,
publication date) for each patent and another one containing geographical information
(e.g., country of invention). These two data sets may be of different output formats
and different data types for the keys and values. We can look to MultipleOutputs ,
introduced in version 0.19 of Hadoop, for more powerful capabilities.
The approach taken by MultipleOutputs is different from MultipleOutputFormat .
Rather than asking for the filename to output each record, MultipleOutputs creates
multiple OutputCollectors . Each OutputCollector can have its own OutputFormat
and types for the key/value pair. Your MapReduce program will decide what to output
to each OutputCollector . Listing 7.2 shows a program that takes our patent metadata
 
Search WWH ::




Custom Search