Database Reference
In-Depth Information
nodes in the cluster close to the data. Why? Transporting blocks of data in a cluster dimin-
ishes performance. Because blocks of HDFS files are normally stored three times, it's likely
that MapReduce can chose nodes to run your jobs on datanodes on which the data is stored.
In a naive virtual environment, the physical location of the data is not known, and in fact, the
real physical storage may be someplace that is not on any node in the cluster at all.
While it's admittedly from a VMware perspective, good background reading on virtualizing
In this chapter, you'll read about some of the open source software that facilitates cloud com-
puting and virtualization. There are also proprietary solutions, but they're not covered in this
edition of the
Field Guide to Hadoop
.
Serengeti
License
Apache License, Version 2.0
Activity
Medium
Purpose
Hadoop Virtualization
Official Page
Hadoop Integration No Integration
If your organization uses VMware's vSphere as the basis of the virtualization strategy, then
Serengeti provides you with a method of quickly building Hadoop clusters in your environ-
ment. Admittedly, vSphere is a proprietary environment, but the code to run Hadoop in this
environment is open source. Though Serengeti is not affiliated with the Apache Software