Database Reference
In-Depth Information
Foundation (which operates many of the other Hadoop-related projects), many people have
successfully used it in deployments.
Why virtualize Hadoop at all? Historically, Hadoop clusters have run on commodity servers
(i.e., Intel x86 machines with their own set of disks running the Linux OS). When scheduling
jobs, Hadoop made use of the location of data in the HDFS (described here ) to run the code
as close to the data as possible, preferably in the same node, to minimize the amount of data
transferred across the network. In many virtualized environments, the directly attached stor-
age is replaced by a common storage device like a storage area network (SAN) or a network
attached storage (NAS). In these environments, there is no notion of locality of storage.
There are good reasons for virtualizing Hadoop, and there seem to be many Hadoop clusters
running on public clouds today:
▪ Speed of quickly spinning up a cluster. You don't need to order and configure hardware.
▪ Ability to quickly increase and reduce the size of the cluster to meet demand for services.
▪ Resistance and recovery from failures managed by the virtualization technology.
And there are some disadvantages:
▪ MapReduce and YARN assume complete control of machine resources. This is not true
in a virtualized environment.
▪ Data layout is critical, so excessive disk head movement may occur and the normal triple
mirroring is critical for data protection. A good virtualization strategy must do the same.
Some do, some don't.
You'll need to weigh the advantages and disadvantages to decide if Virtual Hadoop is appro-
priate for your projects.
Tutorial Links
Background reading on virtualizing Hadoop can be found at:
“Deploying Hadoop with Serengeti”
The Virtual Hadoop wiki
“Hadoop Virtualization Extensions on VMware vSphere 5”
Search WWH ::




Custom Search