New Data Warehouse Technologies - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

warehousing paradigms are built on the technologies that we study in the

first part of the chapter.

13.1 MapReduce and Hadoop

MapReduce is a processing framework originally developed by Google to

perform web search on a very large number of commodity machines.

MapReduce can be implemented in many languages over many data formats.

It works on the concept of divide and conquer, breaking a task into smaller

chunks and processing them in parallel over a collection of identical machines

(a cluster). Data in each processor are typically stored in the file system,

although data in database management systems (DBMSs) are supported

by several extensions, like HadoopDB. A MapReduce program consists of

two phases, namely, Map and Reduce, which run in parallel in clustered

commodity servers as we will see below.

Among the many MapReduce implementations, the most popular one is

Hadoop , an open-source framework written in Java. It has the capability

to handle structured, unstructured, or semistructured data using commodity

hardware, dividing a task into parallel chunks of jobs and data. Hadoop

runs on its distributed file system (HDFS) but can also read and write

other file systems. Hadoop uses blocks (typically of 128MB) to store files

on the file system. One block of Hadoop may consist of many blocks of the

underlying operating system. Moreover, blocks can be replicated in several

different nodes. For example, block1 canbestoredin node1 and node3 , block2

in node2 and node4 , and so on. There are two main pieces of software that

handle MapReduce jobs:

The job tracker receives all the jobs from clients, schedules the Map

and Reduce tasks to appropriate task trackers, monitors failing tasks, and

reschedules them to different task tracker nodes. One job tracker exists in

each Hadoop cluster.

The task trackers are the modules that execute the job. There are many

task trackers in a Hadoop cluster to manage parallelism in Map and Reduce

tasks. They continuously send messages to the job tracker to let the latter

know that they are alive and asking for a task.

The process and elements involved in a MapReduce job can be succinctly

described as follows:

The MapReduce program tells a job client to run a MapReduce job. The

job client sends a message to a job tracker and gets an ID for the job.

The job client copies the job resources (e.g., a .jar file) to the shared file

system, usually HDFS.

The job client sends a request to the job tracker to start the job. The job

tracker computes the ways of splitting the data so that it can send each

chunk of job to a different mapper process to maximize throughput.

Search WWH ::

Custom Search

Home