Databases Reference
In-Depth Information
In a more formal definition, Apache Hadoop is an open-source software framework that facilitates
distributed processing of large data sets across clusters of computers by using a simple programming
model. It consists of two primary components: Hadoop Distributed File System (HDFS), a reliable and
distributed data storage; and MapReduce, a parallel and distributed processing system. A Hadoop
cluster can be made up of a single node or thousands.
In short, Hadoop = MapReduce + HDFS.
Mapreduce
The idea behind MapReduce is to divide and conquer in a way that you can add nodes to your
cluster to help handle data. The MapReduce programming model establishes a core abstraction that
provides closure for map-and-reduce operations. The MapReduce programming model views all of
its jobs as computations over key-value pair datasets. So, both input and output files must contain
datasets that consist only of key-value pairs. Figure B-2 shows a conceptual rendition of a MapReduce
job wherein there are inputs, mapping instructions, reducer instructions, and ultimately an output
that is typically aggregated data.
FIGURE B-2 A diagram of a MapReduce job.
Pig and Hive
Other Hadoop-related projects such as Pig and Hive are built on top of HDFS and the MapReduce
framework and are used to provide a simpler way to manage a cluster than working with the Ma-
pReduce programs directly. With Pig, for example, you can write programs by using JavaScript that
are compiled to MapReduce programs on the cluster. It also provides fluent controls to manage data
flow. Hive provides a table abstraction for data in files stored in a cluster that can be queried by using
SQL-like statements.
Search WWH ::




Custom Search