Databases Reference
In-Depth Information
After executing the preceding program, we can see that the output directory now has
a separate directory for each country.
ls output/
AD BN CS GE IN LC MT PH SV VE
AE BO CU GF IQ LI MU PK SY VG
AG BR CY GH IR LK MW PL SZ VN
AI BS CZ GL IS LR MX PT TC VU
AL BY DE GN IT LT MY PY TD YE
AM BZ DK GP JM LU NC RO TH YU
AN CA DO GR JO LV NF RU TN ZA
AR CC DZ GT JP LY NG SA TR ZM
AT CD EC GY KE MA NI SD TT ZW
AU CH EE HK KG MC NL SE TW
AW CI EG HN KN MG NO SG TZ
AZ CK ES HR KP MH NZ SI UA
BB CL ET HT KR ML OM SK UG
BE CM FI HU KW MM PA SM US
BG CN FO ID KY MO PE SN UY
BH CO FR IE KZ MQ PF SR UZ
BM CR GB IL LB MR PG SU VC
And within the directory for each country are files with only records (patents) created
by those countries.
ls output/AD
part-00003 part-00005 part-00006
head output/AD/part-00006
5765303,1998,14046,1996,"AD","",,1,12,42,5,59,11,1,0.4545,0,0,1,67.3636,,,,
5785566,1998,14088,1996,"AD","",,1,9,441,6,69,3,0,1,,0.6667,,4.3333,,,,
5894770,1999,14354,1997,"AD","",,1,,82,5,51,4,0,1,,0.625,,7.5,,,,
We've written this simple partitioning exercise as a map-only program. You can apply
the same technique to the output of reducers as well. Be careful not to confuse this
with the partitioner in the MapReduce framework. That partitioner looks at the keys
of
intermediate
records and decides which reducer will process them. The partitioning
we're doing here looks at the key/value pair of the
output
and decides which file to
store to.
MultipleOutputFormat
is simple, but it's also limited. For example, we were able to
split the input data by row, but what if we want to split by column? Let's say we want to create
two data sets from the patent metadata: one containing time-related information (e.g.,
publication date) for each patent and another one containing geographical information
(e.g., country of invention). These two data sets may be of different output formats
and different data types for the keys and values. We can look to
MultipleOutputs
,
introduced in version 0.19 of Hadoop, for more powerful capabilities.
The approach taken by
MultipleOutputs
is different from
MultipleOutputFormat
.
Rather than asking for the filename to output each record,
MultipleOutputs
creates
multiple
OutputCollectors
. Each
OutputCollector
can have its own
OutputFormat
and types for the key/value pair. Your MapReduce program will decide what to output
to each
OutputCollector
. Listing 7.2 shows a program that takes our patent metadata