Advanced MapReduce - Hadoop in Action

Databases Reference

In-Depth Information

2,C,32.00,30-Nov-2007

3,D,25.02,22-Jan-2009

If we want an inner join

of the two data sets above, the desired output would look a

listing 5.2.

Listing 5.2 Desired output of an inner join between Customers and Orders data

1,Stephanie Leung,555-555-5555,B,88.25,20-May-2008

2,Edward Kim,123-456-7890,C,32.00,30-Nov-2007

3,Jose Madriz,281-330-8004,A,12.95,02-Jun-2008

3,Jose Madriz,281-330-8004,D,25.02,22-Jan-2009

Hadoop can also perform outer joins, although to simplify explanation we focus on

inner joins.

5.2.1

Reduce-side joining

Hadoop has a contrib package called datajoin that works as a generic framework

for data joining in Hadoop. Its jar file is at contrib/datajoin/hadoop-*-datajoin.

jar. To distinguish it from other joining techniques, it's called the reduce-side join ,

as we do most of the processing on the reduce side. It's also known as the reparti-

tioned join

(or the repartitioned sort-merge join ), as it's the same as the database tech-

nique of the same name. Although it's not the most efficient joining technique, it's

the most general and forms the basis of some more advanced techniques (such as

the semijoin).

Reduce-side join introduces some new terminologies and concepts, namely, data

source, tag, and group key. A data source is analogous to a table in relational

databases.

We have two data sources in our toy example: Customers and Orders. A data source

can be a single file or multiple files. The important point is that all the records in a data

source have the same structure, analogous to a schema.

The MapReduce paradigm calls for processing each record one at a time in a stateless

manner. If we want some state information to persist, we have to tag the record with

such state. For example, given our two files, a record may look to a mapper like this:

3,Jose Madriz,281-330-8004

or:

3,A,12.95,02-Jun-2008

where the record type (Customers or Orders) is dissociated from the record itself. Tag-

ging the record will ensure that specific metadata will always go along with the record.

For the purpose of data joining, we want to tag each record with its data source .

Search WWH ::

Custom Search

Home