Scheduling and Workflow - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Chapter 5

Scheduling and Workflow

When you're working with big data in a distributed, parallel processing environment like Hadoop, job scheduling

and workflow management are vital for efficient operation. Schedulers enable you to share resources at a job level

within Hadoop; in the first half of this chapter, I use practical examples to guide you in installing, configuring, and

using the Fair and Capacity schedulers for Hadoop V1 and V2. Additionally, at a higher level, workflow tools enable

you to manage the relationships between jobs. For instance, a workflow might include jobs that source, clean,

process, and output a data source. Each job runs in sequence, with the output from one forming the input for the

next. So, in the second half of this chapter, I demonstrate how workflow tools like Oozie offer the ability to manage

these relationships.

An Overview of Scheduling

The default Hadoop scheduler was FIFO—first in, first out. It did not support task preemption, which is the ability to

temporarily halt a task and allow another task to access those resources. Apache, though, offers two extra schedulers

for Hadoop—Capacity and Fair—that you can use in place of the default.

Although both schedulers are available in Hadoop V1 and V2, the functions they offer depend on the Hadoop

version you're using. To decide which scheduler is right for your applications, take a look at the prime features of each,

detailed in the following sections, or consult the Apache Software Foundation website at hadoop.apache.org/docs for

in-depth information.

The Capacity Scheduler

Capacity handles large clusters that are shared among multiple organizations or groups. In this multi-tenancy

environment, a cluster can have many job types and multiple job priorities. Some key features of Capacity are as

follows:

Organization : As it is designed for situations in which clusters need to support multi-tenancy, its resource

sharing is more stringent so as to meet capacity, security, and resource guarantees.

Capacity : Resources are allocated to queues and are shared among the jobs on that queue. It is possible to set

soft and hard limits on queue-based resources.

Security : In a multi-tenancy cluster, security is a major concern. Capacity uses access control lists (ACLs) to

manage queue-based job access. It also permits per-queue administration, so that you can have different settings on

the queues.

Elasticity : Free resources from under-utilized queues can be assigned to queues that have reached their

capacities. When needed elsewhere, these resources can then be reassigned, thereby maximizing utilization.

Multi-tenancy : In a multi-tenancy environment, a single user's rogue job could possibly soak up multiple

tenants' resources, which would have a serious impact on job-based service-level agreements (SLAs). The Capacity

scheduler provides a range of limits for these multiple jobs, users, and queues so as to avoid this problem.

Search WWH ::

Custom Search

Home