Information Technology Reference
In-Depth Information
splits to map tasks. The local output from the map tasks is copied to reduce
nodes where it is sorted and merged for processing by reduce tasks that pro-
duce the final output.
The Hadoop execution environment supports additional distributed
data processing capabilities that are designed to run using the Hadoop
MapReduce architecture. Several of these have become official Hadoop
subprojects within the Apache Software Foundation. These include a dis-
tributed file system called HDFS, which is analogous to GFS in the Google
MapReduce implementation. HBase is a distributed column-oriented data-
base that provides similar random access read/write capabilities and is mod-
eled after BigTable was implemented by Google. HBase is not relational, and
does not support SQL, but provides a Java API and a command-line shell for
table management. Hive is a data warehouse system built on top of Hadoop
that provides SQL-like query capabilities for data summarization, ad hoc
queries, and analysis of large data sets. Other Apache-sanctioned projects
for Hadoop include Avro, a data serialization system that provides dynamic
integration with scripting languages; Chukwa, a data collection system for
managing large distributed systems; Pig, a high-level dataflow language
and execution framework for parallel computation; and ZooKeeper, a high-
performance coordination service for distributed applications.
17.3.1 Hadoop Distributed File System (HDFS)
HDFS is a highly fault-tolerant, scalable, and distributed file system archi-
tected to run on commodity hardware. The HDFS architecture was designed
to solve two known problems experienced by the early developers of large-
scale data processing. The first problem was the ability to break down the
files across multiple systems and process each piece of the file independent
of the other pieces and finally consolidate all the outputs in a single result
set. The second problem was the fault tolerance both at the file processing
level and the overall system level in the distributed data processing systems.
The three principle goals of HDFS architecture are
1. Process extremely large files ranging from multiple gigabytes to
petabytes
2. Streaming data processing to read data at high-throughput rates and
process data on read
3. Capability to execute on commodity hardware with no special hard-
ware requirements
The HDFS design is based on the following assumptions:
Redundancy —Hardware will be prone to failure, and processes can
run out of infrastructure resources, but redundancy built into the
design can handle these situations.
Search WWH ::




Custom Search