Database Reference
In-Depth Information
BOOM-MR re-implements the MapReduce framework by replacing Hadoop's
core scheduling logic with Overlog. The JobTracker tracks the ongoing status of the
system and transient state in the form of messages sent and received by the JobTracker
by capturing this information in four Overlog tables: job, task, taskAttempt , and
taskTracker . The job relation contains a single row for each job submitted to the
JobTracker . The task relation identifies each task within a job. The attributes of this
relation identify the task type (map or reduce), the input partition (a chunk for map
tasks, a bucket for reduce tasks), and the current running status. The taskAttempt
relation maintains the state of each task attempt (a task may be attempted more than
once due to speculation or if the initial execution attempt failed). The taskTracker
relation identifies each TaskTracker in the cluster with a unique name. Overlog rules
are used to update the JobTracker's tables by converting inbound messages into
tuples of the four Overlog tables. Scheduling decisions are encoded in the taskAt-
tempt table that assigns tasks to TaskTrackers . A scheduling policy is simply a set of
rules that join against the taskTracker relation to find TaskTrackers with unassigned
slots and schedules tasks by inserting tuples into taskAttempt . This architecture
allows new scheduling policies to be defined easily.
2.6.6 h yraCks /asteriX
Hyracks is presented as a partitioned-parallel dataflow execution platform that runs on
shared-nothing clusters of computers [23]. Large collections of data items are stored
as local partitions that are distributed across the nodes of the cluster. A Hyracks job
submitted by a client is processed as one or more collections of data to produce one
or more output collections (partitions). Hyracks provides a programming model and
an accompanying infrastructure to efficiently divide computations on large data col-
lections (spanning multiple machines) into computations that work on each partition of
the data separately. Every Hyracks cluster is managed by a Cluster Controller process.
The Cluster Controller accepts job execution requests from clients, plans their evaluation
strategies, and then schedules the jobs' tasks to run on selected machines in the cluster.
In addition, it is responsible for monitoring the state of the cluster to keep track of the
resource loads at the various worker machines. The Cluster Controller is also responsible
for re-planning and re-executing some or all of the tasks of a job in the event of a failure.
On the task execution side, each worker machine that participates in a Hyracks cluster
runs a Node Controller process. The Node Controller accepts task execution requests
from the Cluster Controller and also reports on its health via a heartbeat mechanism.
In Hyracks, data flows between operators over connectors in the form of records
that can have an arbitrary number of fields. Hyracks provides support for express-
ing data-type-specific operations such as comparisons and hash functions. The way
Hyracks uses a record as the carrier of data is a generalization of the (key, value)
concept of MapReduce. Hyracks strongly promotes the construction of reusable
operators and connectors that end users can use to build their jobs. The basic set of
Hyracks operators include
The File Readers/Writers operators are used to read and write files in vari-
ous formats from/to local file systems and the HDFS.
Search WWH ::




Custom Search