Cloudware Application Development - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

to the base Hadoop implementation and currently include Avro, Chukwa,

HBase, Hive, Pig, and ZooKeeper.

The Apache Hadoop project develops open-source software for reliable,

scalable, distributed computing. Hadoop includes these subprojects:

• Hadoop Common : The common utilities that support the other

Hadoop subprojects

• Avro : A data serialization system that provides dynamic integration

with scripting languages

• Cassandra : A scalable multimaster database with no single point of

failure

• Chukwa : A data collection system for managing large distributed

systems

• HBase : A scalable, distributed database that supports structured

data storage for large tables

• HDFS : A distributed file system that provides high-throughput

access to application data

• Hive : A data warehouse infrastructure that provides data summari-

zation and adhoc querying

• MapReduce : A software framework for distributed processing of

large data sets on compute clusters

• Mahout : A scalable machine learning and data mining library

• Pig : A high-level dataflow language and execution framework for

parallel computation

• ZooKeeper : A high-performance coordination service for distributed

applications

The Hadoop MapReduce architecture is functionally similar to the Google

implementation except that the base programming language for Hadoop is

Java instead of C++. The implementation is intended to execute on clusters of

commodity processors utilizing Linux as the operating system environment

but can also be run on a single system as a learning environment. Hadoop

clusters also utilize the shared nothing distributed processing paradigm link-

ing individual systems with local processor, memory, and disk resources

using high-speed communication switching capabilities typically in rack-

mounted configurations. The flexibility of Hadoop configurations allows

small clusters to be created for testing and development using desktop sys-

tems or any system running Unix/Linux providing a JVM environment;

however, production clusters typically use homogeneous rack-mounted pro-

cessors in a data center environment.

The Hadoop MapReduce architecture is similar to the Google implemen-

tation creating fixed-size input splits from the input data and assigning the

Search WWH ::

Custom Search

Home