Databases Reference
In-Depth Information
Joining is the canonical example for processing data from multiple data sources.
Though Hadoop has a powerful datajoin package for doing arbitrary joins, its generality
comes at the expense of efficiency. A couple other joining methods can provide faster
joins by exploiting the relative asymmetry in data source sizes typical of most data joins.
One of these methods leverages the Bloom filter, a data structure that's useful in many
data processing tasks.
At this point, your knowledge of MapReduce programming should enable you to
start writing your own programs. As all programmers know, programming is more
than writing code. You have various techniques and processes—from development to
deployment and testing and debugging. The nature of MapReduce programming and
distributed computing adds complexity and nuance to these processes, which we'll
cover in the next chapter.
5.6 Further resources
http://portal.acm.org/citation.cfm?doid=1247480.1247602 —MapReduce's lack of
simple support for joining datasets is well-known. Many of the tools to enhance
Hadoop (such as Pig, Hive, and CloudBase) offer data joins as a first-class
operation. For a more formal treatment, Hung-chih Yang and coauthors have
published a paper “Map-reduce-merge: simplified relational data processing
on large clusters” that proposes a modified form of MapReduce with an extra
“merge” step that supports data joining natively.
http://umiacs.umd.edu/~jimmylin/publications/Lin_etal_TR2009. pdf —Section
5.2.2 describes the use of distributed cache to provide side data to tasks.
The limitation of this technique is that the side data is replicated to every
TaskTracker, and the side data must fit into memory. Jimmy Lin and colleagues
explore the use of memcached, a distributed in-memory object caching system,
to provide global access to side data. Their experience is summarized in the
paper “Low-Latency, High-Throughput Access to Static Global Resources within
the Hadoop Framework.”
 
Search WWH ::




Custom Search