Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

BOOM-MR re-implements the MapReduce framework by replacing Hadoop's

core scheduling logic with Overlog. The JobTracker tracks the ongoing status of the

system and transient state in the form of messages sent and received by the JobTracker

by capturing this information in four Overlog tables: job, task, taskAttempt , and

taskTracker . The job relation contains a single row for each job submitted to the

JobTracker . The task relation identifies each task within a job. The attributes of this

relation identify the task type (map or reduce), the input partition (a chunk for map

tasks, a bucket for reduce tasks), and the current running status. The taskAttempt

relation maintains the state of each task attempt (a task may be attempted more than

once due to speculation or if the initial execution attempt failed). The taskTracker

relation identifies each TaskTracker in the cluster with a unique name. Overlog rules

are used to update the JobTracker's tables by converting inbound messages into

tuples of the four Overlog tables. Scheduling decisions are encoded in the taskAt-

tempt table that assigns tasks to TaskTrackers . A scheduling policy is simply a set of

rules that join against the taskTracker relation to find TaskTrackers with unassigned

slots and schedules tasks by inserting tuples into taskAttempt . This architecture

allows new scheduling policies to be defined easily.

2.6.6 h yraCks /asteriX

Hyracks is presented as a partitioned-parallel dataflow execution platform that runs on

shared-nothing clusters of computers [23]. Large collections of data items are stored

as local partitions that are distributed across the nodes of the cluster. A Hyracks job

submitted by a client is processed as one or more collections of data to produce one

or more output collections (partitions). Hyracks provides a programming model and

an accompanying infrastructure to efficiently divide computations on large data col-

lections (spanning multiple machines) into computations that work on each partition of

the data separately. Every Hyracks cluster is managed by a Cluster Controller process.

The Cluster Controller accepts job execution requests from clients, plans their evaluation

strategies, and then schedules the jobs' tasks to run on selected machines in the cluster.

In addition, it is responsible for monitoring the state of the cluster to keep track of the

resource loads at the various worker machines. The Cluster Controller is also responsible

for re-planning and re-executing some or all of the tasks of a job in the event of a failure.

On the task execution side, each worker machine that participates in a Hyracks cluster

runs a Node Controller process. The Node Controller accepts task execution requests

from the Cluster Controller and also reports on its health via a heartbeat mechanism.

In Hyracks, data flows between operators over connectors in the form of records

that can have an arbitrary number of fields. Hyracks provides support for express-

ing data-type-specific operations such as comparisons and hash functions. The way

Hyracks uses a record as the carrier of data is a generalization of the (key, value)

concept of MapReduce. Hyracks strongly promotes the construction of reusable

operators and connectors that end users can use to build their jobs. The basic set of

Hyracks operators include

•

The File Readers/Writers operators are used to read and write files in vari-

ous formats from/to local file systems and the HDFS.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home