Microsoft and “Big Data” - Business Intelligence in Microsoft SharePoint 2013

Databases Reference

In-Depth Information

In a more formal definition, Apache Hadoop is an open-source software framework that facilitates

distributed processing of large data sets across clusters of computers by using a simple programming

model. It consists of two primary components: Hadoop Distributed File System (HDFS), a reliable and

distributed data storage; and MapReduce, a parallel and distributed processing system. A Hadoop

cluster can be made up of a single node or thousands.

In short, Hadoop = MapReduce + HDFS.

Mapreduce

The idea behind MapReduce is to divide and conquer in a way that you can add nodes to your

cluster to help handle data. The MapReduce programming model establishes a core abstraction that

provides closure for map-and-reduce operations. The MapReduce programming model views all of

its jobs as computations over key-value pair datasets. So, both input and output files must contain

datasets that consist only of key-value pairs. Figure B-2 shows a conceptual rendition of a MapReduce

job wherein there are inputs, mapping instructions, reducer instructions, and ultimately an output

that is typically aggregated data.

FIGURE B-2 A diagram of a MapReduce job.

Pig and Hive

Other Hadoop-related projects such as Pig and Hive are built on top of HDFS and the MapReduce

framework and are used to provide a simpler way to manage a cluster than working with the Ma-

pReduce programs directly. With Pig, for example, you can write programs by using JavaScript that

are compiled to MapReduce programs on the cluster. It also provides fluent controls to manage data

flow. Hive provides a table abstraction for data in files stored in a cluster that can be queried by using

SQL-like statements.

Search WWH ::

Custom Search

Home