Cloud Computing and Virtualization - Field Guide to Hadoop

Database Reference

In-Depth Information

Foundation (which operates many of the other Hadoop-related projects), many people have

successfully used it in deployments.

Why virtualize Hadoop at all? Historically, Hadoop clusters have run on commodity servers

(i.e., Intel x86 machines with their own set of disks running the Linux OS). When scheduling

jobs, Hadoop made use of the location of data in the HDFS (described here ) to run the code

as close to the data as possible, preferably in the same node, to minimize the amount of data

transferred across the network. In many virtualized environments, the directly attached stor-

age is replaced by a common storage device like a storage area network (SAN) or a network

attached storage (NAS). In these environments, there is no notion of locality of storage.

There are good reasons for virtualizing Hadoop, and there seem to be many Hadoop clusters

running on public clouds today:

▪ Speed of quickly spinning up a cluster. You don't need to order and configure hardware.

▪ Ability to quickly increase and reduce the size of the cluster to meet demand for services.

▪ Resistance and recovery from failures managed by the virtualization technology.

And there are some disadvantages:

▪ MapReduce and YARN assume complete control of machine resources. This is not true

in a virtualized environment.

▪ Data layout is critical, so excessive disk head movement may occur and the normal triple

mirroring is critical for data protection. A good virtualization strategy must do the same.

Some do, some don't.

You'll need to weigh the advantages and disadvantages to decide if Virtual Hadoop is appro-

priate for your projects.

Tutorial Links

Background reading on virtualizing Hadoop can be found at:

▪ The Virtual Hadoop wiki

Search WWH ::

Custom Search

Home