Databases Reference
In-Depth Information
2,C,32.00,30-Nov-2007
3,D,25.02,22-Jan-2009
If we want an inner join
of the two data sets above, the desired output would look a
listing 5.2.
Listing 5.2 Desired output of an inner join between Customers and Orders data
1,Stephanie Leung,555-555-5555,B,88.25,20-May-2008
2,Edward Kim,123-456-7890,C,32.00,30-Nov-2007
3,Jose Madriz,281-330-8004,A,12.95,02-Jun-2008
3,Jose Madriz,281-330-8004,D,25.02,22-Jan-2009
Hadoop can also perform outer joins, although to simplify explanation we focus on
inner joins.
5.2.1
Reduce-side joining
Hadoop has a contrib package called datajoin that works as a generic framework
for data joining in Hadoop. Its jar file is at contrib/datajoin/hadoop-*-datajoin.
jar. To distinguish it from other joining techniques, it's called the reduce-side join ,
as we do most of the processing on the reduce side. It's also known as the reparti-
tioned join
(or the repartitioned sort-merge join ), as it's the same as the database tech-
nique of the same name. Although it's not the most efficient joining technique, it's
the most general and forms the basis of some more advanced techniques (such as
the semijoin).
Reduce-side join introduces some new terminologies and concepts, namely, data
source, tag, and group key. A data source is analogous to a table in relational
databases.
We have two data sources in our toy example: Customers and Orders. A data source
can be a single file or multiple files. The important point is that all the records in a data
source have the same structure, analogous to a schema.
The MapReduce paradigm calls for processing each record one at a time in a stateless
manner. If we want some state information to persist, we have to tag the record with
such state. For example, given our two files, a record may look to a mapper like this:
3,Jose Madriz,281-330-8004
or:
3,A,12.95,02-Jun-2008
where the record type (Customers or Orders) is dissociated from the record itself. Tag-
ging the record will ensure that specific metadata will always go along with the record.
For the purpose of data joining, we want to tag each record with its data source .
 
 
Search WWH ::




Custom Search