Cloudware Application Development - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

splits to map tasks. The local output from the map tasks is copied to reduce

nodes where it is sorted and merged for processing by reduce tasks that pro-

duce the final output.

The Hadoop execution environment supports additional distributed

data processing capabilities that are designed to run using the Hadoop

MapReduce architecture. Several of these have become official Hadoop

subprojects within the Apache Software Foundation. These include a dis-

tributed file system called HDFS, which is analogous to GFS in the Google

MapReduce implementation. HBase is a distributed column-oriented data-

base that provides similar random access read/write capabilities and is mod-

eled after BigTable was implemented by Google. HBase is not relational, and

does not support SQL, but provides a Java API and a command-line shell for

table management. Hive is a data warehouse system built on top of Hadoop

that provides SQL-like query capabilities for data summarization, ad hoc

queries, and analysis of large data sets. Other Apache-sanctioned projects

for Hadoop include Avro, a data serialization system that provides dynamic

integration with scripting languages; Chukwa, a data collection system for

managing large distributed systems; Pig, a high-level dataflow language

and execution framework for parallel computation; and ZooKeeper, a high-

performance coordination service for distributed applications.

17.3.1 Hadoop Distributed File System (HDFS)

HDFS is a highly fault-tolerant, scalable, and distributed file system archi-

tected to run on commodity hardware. The HDFS architecture was designed

to solve two known problems experienced by the early developers of large-

scale data processing. The first problem was the ability to break down the

files across multiple systems and process each piece of the file independent

of the other pieces and finally consolidate all the outputs in a single result

set. The second problem was the fault tolerance both at the file processing

level and the overall system level in the distributed data processing systems.

The three principle goals of HDFS architecture are

1. Process extremely large files ranging from multiple gigabytes to

petabytes

2. Streaming data processing to read data at high-throughput rates and

process data on read

3. Capability to execute on commodity hardware with no special hard-

ware requirements

The HDFS design is based on the following assumptions:

• Redundancy —Hardware will be prone to failure, and processes can

run out of infrastructure resources, but redundancy built into the

design can handle these situations.

Search WWH ::

Custom Search

Home