Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Summary

This chapter examined the MapReduce paradigm and its application in Big Data

analytics. Specifically, it examined the implementation of MapReduce in Apache

Hadoop. The power of MapReduce is realized with the use of the Hadoop

Distributed File System (HDFS) to store data in a distributed system. The ability

to run a MapReduce job on the data stored across a cluster of machines enables

the parallel processing of petabytes or exabytes of data. Furthermore, by adding

additional machines to the cluster, Hadoop can scale as the data volumes grow.

This chapter examined several Apache projects within the Hadoop ecosystem. By

providing a higher-level programming language, Apache Pig and Hive simplify

the code development by masking the underlying MapReduce logic to perform

common data processing tasks such as filtering, joining datasets, and restructuring

data. Once the data is properly conditioned within the Hadoop cluster, Apache

Mahout can be used to conduct data analyses such as clustering, classification, and

collaborative filtering.

The strength of MapReduce in Apache Hadoop and the so far mentioned projects

in the Hadoop ecosystem are in batch processing environments. When real-time

processing, including read and writes, are required, Apache HBase is an option.

HBase uses HDFS to store large volumes of data across the cluster, but it also

maintains recent changes within memory to ensure the real-time availability of

the latest data. Whereas MapReduce in Hadoop, Pig, and Hive are more

general-purpose tools that can address a wide range of tasks, HBase is a somewhat

more purpose-specific tool. Data will be retrieved from and written to the HBase in

a well-understood manner.

HBase is one example of the NoSQL (Not only Structured Query Language) data

stores that are being developed to address specific Big Data use cases. Maintaining

and traversing social network graphs are examples of relational databases not being

the best choice as a data store. However, relational databases and SQL remain

powerful and common tools and will be examined in more detail in Chapter 11.

Search WWH ::

Custom Search

Home