Cookbook - Hadoop in Action

Databases Reference

In-Depth Information

-rwxrwxrwx 1 Administrator None 0 Jul 31 06:28 part-00000

-rwxrwxrwx 1 Administrator None 0 Jul 31 06:28 part-00001

-rwxrwxrwx 1 Administrator None 0 Jul 31 06:29 part-00002

-rwxrwxrwx 1 Administrator None 0 Jul 31 06:29 part-00003

-rwxrwxrwx 1 Administrator None 0 Jul 31 06:29 part-00004

-rwxrwxrwx 1 Administrator None 0 Jul 31 06:29 part-00005

-rwxrwxrwx 1 Administrator None 0 Jul 31 06:29 part-00006

We have a set of files prefixed with chrono and another set of files prefixed with geo .

Note that the program created the default output files part-* even though it wrote

nothing explicitly. It's entirely possible to write to these files using the original

OutputCollector passed in through the map() method. In fact, if this was not a

map-only program, records written to the original OutputCollector , and only those

records, would be passed to the reducers for processing.

One of the trade-offs with MultipleOutputs is that it has a rigid naming structure

compared to MultipleOutputFormat . Your output collector's name cannot be part ,

because that's already in use for the default. The output filename is also strictly defined

as the output collector's name followed by m or r depending on whether the output was

collected at the mapper or the reducer. It's finally followed by a partition number.

head output/chrono-m-00000

"PATENT","GYEAR","GDATE"

3070801,1963,1096

3070802,1963,1096

3070803,1963,1096

3070804,1963,1096

3070805,1963,1096

3070806,1963,1096

3070807,1963,1096

3070808,1963,1096

3070809,1963,1096

head output/geo-m-00000

"PATENT","COUNTRY","POSTATE"

3070801,"BE",""

3070802,"US","TX"

3070803,"US","IL"

3070804,"US","OH"

3070805,"US","CA"

3070806,"US","PA"

3070807,"US","OH"

3070808,"US","IA"

3070809,"US","AZ"

Looking at the output files, we see that we've successfully projected out the columns on

the patent data set into distinct files.

7.4

Inputting from

and outputting to a database

Although Hadoop is useful for processing large data, relational databases remain

the workhorse of many data processing applications. Oftentimes Hadoop will need to

interface with databases.

Hadoop in Action

Search WWH ::

Custom Search

Home