Databases Reference
In-Depth Information
MapReduce
Job scheduling/exection system
HBase (Key-value store)
HDFS
(Hadoop distributed file system)
FIGURE 4.5
Core Hadoop components.
Other Hadoop-related projects include:
Avro™—a data serialization system.
Cassandra™—a scalable multimaster database with no single points of failure.
Chukwa™—a data collection system for managing large distributed systems.
HBase™—a scalable, distributed database that supports structured data storage for large tables.
Hive™—a data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™—a scalable machine learning and data mining library.
Pig™—a high-level data-flow language and execution framework for parallel computation.
ZooKeeper™—a high-performance coordination service for distributed applications.
Hadoop core components
At the heart of the Hadoop framework or architecture there are components that can be called the
foundational core. These components are shown in Figure 4.5 and discussed in detail in the following
subsections.
HDFS
HDFS is a highly fault-tolerant, scalable, and distributed file system architected to run on commodity
hardware.
The HDFS architecture was designed to solve two known problems experienced by the early
developers of large-scale data processing. The first problem was the ability to break down the files
across multiple systems and process each piece of the file independent of the other pieces and finally
consolidate all the outputs in a single result set. The second problem was the fault tolerance both at
the file processing level and the overall system level in the distributed data processing systems.
Some of the assumptions of HDFS design are
Redundancy —hardware will be prone to failure and processes can run out of infrastructure
resources, but redundancy built into the design can handle these situations.
Scalability —linear scalability at a storage layer is needed to utilize parallel processing at its
optimum level. Designing for 100% linear scalability.
Fault tolerance —the automatic ability to recover from failure and complete the processing of data.
 
Search WWH ::




Custom Search