Advanced MapReduce - Hadoop in Action

Databases Reference

In-Depth Information

Joining is the canonical example for processing data from multiple data sources.

Though Hadoop has a powerful datajoin package for doing arbitrary joins, its generality

comes at the expense of efficiency. A couple other joining methods can provide faster

joins by exploiting the relative asymmetry in data source sizes typical of most data joins.

One of these methods leverages the Bloom filter, a data structure that's useful in many

data processing tasks.

At this point, your knowledge of MapReduce programming should enable you to

start writing your own programs. As all programmers know, programming is more

than writing code. You have various techniques and processes—from development to

deployment and testing and debugging. The nature of MapReduce programming and

distributed computing adds complexity and nuance to these processes, which we'll

cover in the next chapter.

5.6 Further resources

simple support for joining datasets is well-known. Many of the tools to enhance

Hadoop (such as Pig, Hive, and CloudBase) offer data joins as a first-class

operation. For a more formal treatment, Hung-chih Yang and coauthors have

published a paper “Map-reduce-merge: simplified relational data processing

on large clusters” that proposes a modified form of MapReduce with an extra

“merge” step that supports data joining natively.

5.2.2 describes the use of distributed cache to provide side data to tasks.

The limitation of this technique is that the side data is replicated to every

TaskTracker, and the side data must fit into memory. Jimmy Lin and colleagues

explore the use of memcached, a distributed in-memory object caching system,

to provide global access to side data. Their experience is summarized in the

paper “Low-Latency, High-Throughput Access to Static Global Resources within

the Hadoop Framework.”

Search WWH ::

Custom Search

Home