Database Reference
In-Depth Information
Chapter 5
Scheduling and Workflow
When you're working with big data in a distributed, parallel processing environment like Hadoop, job scheduling
and workflow management are vital for efficient operation. Schedulers enable you to share resources at a job level
within Hadoop; in the first half of this chapter, I use practical examples to guide you in installing, configuring, and
using the Fair and Capacity schedulers for Hadoop V1 and V2. Additionally, at a higher level, workflow tools enable
you to manage the relationships between jobs. For instance, a workflow might include jobs that source, clean,
process, and output a data source. Each job runs in sequence, with the output from one forming the input for the
next. So, in the second half of this chapter, I demonstrate how workflow tools like Oozie offer the ability to manage
these relationships.
An Overview of Scheduling
The default Hadoop scheduler was FIFO—first in, first out. It did not support task preemption, which is the ability to
temporarily halt a task and allow another task to access those resources. Apache, though, offers two extra schedulers
for Hadoop—Capacity and Fair—that you can use in place of the default.
Although both schedulers are available in Hadoop V1 and V2, the functions they offer depend on the Hadoop
version you're using. To decide which scheduler is right for your applications, take a look at the prime features of each,
detailed in the following sections, or consult the Apache Software Foundation website at hadoop.apache.org/docs for
in-depth information.
The Capacity Scheduler
Capacity handles large clusters that are shared among multiple organizations or groups. In this multi-tenancy
environment, a cluster can have many job types and multiple job priorities. Some key features of Capacity are as
follows:
Organization : As it is designed for situations in which clusters need to support multi-tenancy, its resource
sharing is more stringent so as to meet capacity, security, and resource guarantees.
Capacity : Resources are allocated to queues and are shared among the jobs on that queue. It is possible to set
soft and hard limits on queue-based resources.
Security : In a multi-tenancy cluster, security is a major concern. Capacity uses access control lists (ACLs) to
manage queue-based job access. It also permits per-queue administration, so that you can have different settings on
the queues.
Elasticity : Free resources from under-utilized queues can be assigned to queues that have reached their
capacities. When needed elsewhere, these resources can then be reassigned, thereby maximizing utilization.
Multi-tenancy : In a multi-tenancy environment, a single user's rogue job could possibly soak up multiple
tenants' resources, which would have a serious impact on job-based service-level agreements (SLAs). The Capacity
scheduler provides a range of limits for these multiple jobs, users, and queues so as to avoid this problem.
 
Search WWH ::




Custom Search