Information Technology Reference
In-Depth Information
to the base Hadoop implementation and currently include Avro, Chukwa,
HBase, Hive, Pig, and ZooKeeper.
The Apache Hadoop project develops open-source software for reliable,
scalable, distributed computing. Hadoop includes these subprojects:
Hadoop Common : The common utilities that support the other
Hadoop subprojects
Avro : A data serialization system that provides dynamic integration
with scripting languages
Cassandra : A scalable multimaster database with no single point of
failure
Chukwa : A data collection system for managing large distributed
systems
HBase : A scalable, distributed database that supports structured
data storage for large tables
HDFS : A distributed file system that provides high-throughput
access to application data
Hive : A data warehouse infrastructure that provides data summari-
zation and adhoc querying
MapReduce : A software framework for distributed processing of
large data sets on compute clusters
Mahout : A scalable machine learning and data mining library
Pig : A high-level dataflow language and execution framework for
parallel computation
ZooKeeper : A high-performance coordination service for distributed
applications
The Hadoop MapReduce architecture is functionally similar to the Google
implementation except that the base programming language for Hadoop is
Java instead of C++. The implementation is intended to execute on clusters of
commodity processors utilizing Linux as the operating system environment
but can also be run on a single system as a learning environment. Hadoop
clusters also utilize the shared nothing distributed processing paradigm link-
ing individual systems with local processor, memory, and disk resources
using high-speed communication switching capabilities typically in rack-
mounted configurations. The flexibility of Hadoop configurations allows
small clusters to be created for testing and development using desktop sys-
tems or any system running Unix/Linux providing a JVM environment;
however, production clusters typically use homogeneous rack-mounted pro-
cessors in a data center environment.
The Hadoop MapReduce architecture is similar to the Google implemen-
tation creating fixed-size input splits from the input data and assigning the
Search WWH ::




Custom Search