New Data Warehouse Technologies - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

p1, 100

p2, 200

p1, 200

p1, 300

...

(p1, 100)

(p2, 200)

(p1, 200)

(p1, 300)

...

(p1, [100, 200,

300, 300])

(p2, [200, 200])

(p3, [300])

(p4, [200, 300])

...

(p1, 300)

(p2, 200)

(p3, 300)

(p4, 300)

...

p1, 300

p2, 200

p4, 200

p3, 300

...

(p1, 300)

(p2, 200)

(p4, 200)

(p3, 300)

...

p2, 200

p3, 500

p1, 200

p4, 100

...

(p2, 200)

(p3, 500)

(p1, 200)

(p4, 100)

...

(p1, [200])

(p2, [200])

(p3, [500])

(p4, [100])

...

(p1, 200)

(p2, 200)

(p3, 500)

(p4, 100)

...

Reduce with

function MAX

Map

Shuffle

Fig. 13.1 A MapReduce process for products

The job tracker sends a Map task or a Reduce task to a task tracker for

execution. The task trackers, based on the job ID, retrieve the job resources

from the distributed file system.

Finally, the task trackers launch a Java virtual machine with a child process

which runs the Map or Reduce code.

Figure 13.1 shows an example of how MapReduce works. Consider that

orders in the Northwind database come from many sources, each from one

country. We are interested in analyzing product sales. The files in this example

contain pairs of the form ( ProductKey , Quantity ). In a Map phase, the input

is divided into a list of key-value pairs with the ProductKey as a key and

the Quantity as a value. This list is then sent as an input to the so-called

Shue phase in which it is sorted such that values with the same ProductKey

are put together. The output from the shue phase is a collection of tuples

of the form (key, list-of-values) . This is forwarded into a Reduce phase where

many different operations like average, sum, or count can be performed. Since

the key-list pairs are independent from each other, they can be forwarded to

multiple task trackers for parallel execution.

The following table summarizes the format of the input and output of the

phases of a MapReduce process:

Input Output

Map (k1,v1) (List(k2,v2))

Shu e (List(k2,v2)) (k2,List(v2))

Reduce (k2,List(v2)) (List(k3,v3))

Data Warehouse Systems: Design and Implementation

Search WWH ::

Custom Search

Home