Cloud Computing and Virtualization - Field Guide to Hadoop

Database Reference

In-Depth Information

Chapter 8. Cloud Computing and

Virtualization

Most Hadoop clusters today run on “real iron”—that is, on small, Intel-based computers run-

ning some variant of the Linux operating system with directly attached storage. However,

you might want to try this in a cloud or virtual environment. While virtualization usually

comes with some degree of performance degradation, you may find it minimal for your task

set or that it's a worthwhile trade-off for the benefits of cloud computing; these benefits in-

clude low up-front costs and the ability to scale up (and down sometimes) as your dataset

and analytic needs change.

By cloud computing, we'll follow guidelines established by the National Institute of Standar-

ds and Technology (NIST), whose definition of cloud computing you'll find here . A Hadoop

cluster in the cloud will have:

▪ On-demand self-service

▪ Network access

▪ Resource sharing

▪ Rapid elasticity

▪ Measured resource service

While these resource need not exist virtually, in practice, they usually do.

Virtualization means creating virtual, as opposed to real, computing entities. Frequently, the

virtualized object is an operating system on which software or applications are overlaid, but

storage and networks can also be virtualized. Lest you think that virtualization is a relatively

new computing technology, in 1972 IBM released VM/370, in which the 370 mainframe

could be divided into many small, single-user virtual machines. Currently, Amazon Web Ser-

vices is likely the most well-known cloud-computing facility. For a brief explanation of vir-

tualization, look here on Wikipedia .

The official Hadoop perspective on cloud computing and virtualization is explained on this

Wikipedia page . One guiding principle of Hadoop is that data analytics should be run on

Search WWH ::

Custom Search

Home