Database Reference
In-Depth Information
p1, 100
p2, 200
p1, 200
p1, 300
...
(p1, 100)
(p2, 200)
(p1, 200)
(p1, 300)
...
(p1, [100, 200,
300, 300])
(p2, [200, 200])
(p3, [300])
(p4, [200, 300])
...
(p1, 300)
(p2, 200)
(p3, 300)
(p4, 300)
...
p1, 300
p2, 200
p4, 200
p3, 300
...
(p1, 300)
(p2, 200)
(p4, 200)
(p3, 300)
...
p2, 200
p3, 500
p1, 200
p4, 100
...
(p2, 200)
(p3, 500)
(p1, 200)
(p4, 100)
...
(p1, [200])
(p2, [200])
(p3, [500])
(p4, [100])
...
(p1, 200)
(p2, 200)
(p3, 500)
(p4, 100)
...
Reduce with
function MAX
Map
Shuffle
Fig. 13.1 A MapReduce process for products
￿ The job tracker sends a Map task or a Reduce task to a task tracker for
execution. The task trackers, based on the job ID, retrieve the job resources
from the distributed file system.
￿ Finally, the task trackers launch a Java virtual machine with a child process
which runs the Map or Reduce code.
Figure 13.1 shows an example of how MapReduce works. Consider that
orders in the Northwind database come from many sources, each from one
country. We are interested in analyzing product sales. The files in this example
contain pairs of the form ( ProductKey , Quantity ). In a Map phase, the input
is divided into a list of key-value pairs with the ProductKey as a key and
the Quantity as a value. This list is then sent as an input to the so-called
Shue phase in which it is sorted such that values with the same ProductKey
are put together. The output from the shue phase is a collection of tuples
of the form (key, list-of-values) . This is forwarded into a Reduce phase where
many different operations like average, sum, or count can be performed. Since
the key-list pairs are independent from each other, they can be forwarded to
multiple task trackers for parallel execution.
The following table summarizes the format of the input and output of the
phases of a MapReduce process:
Input Output
Map (k1,v1) (List(k2,v2))
Shu e (List(k2,v2)) (k2,List(v2))
Reduce (k2,List(v2)) (List(k3,v3))
 
Search WWH ::




Custom Search