Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Key Concepts

Hadoop

Hadoop Ecosystem

MapReduce

NoSQL

Chapter 4, “Advanced Analytical Theory and Methods: Clustering,” through Chapter

9, “Advanced Analytical Theory and Methods: Text Analysis,” covered several useful

analytical methods to classify, predict, and examine relationships within the data.

This chapter and Chapter 11, “Advanced Analytics—Technology and Tools:

In-Database Analytics,” address several aspects of collecting, storing, and

processing unstructured and structured data, respectively. This chapter presents

some key technologies and tools related to the Apache Hadoop software library,

“a framework that allows for the distributed processing of large datasets across

clusters of computers using simple programming models” [1].

This chapter focuses on how Hadoop stores data in a distributed system and how

Hadoop implements a simple programming paradigm known as MapReduce.

Although this chapter makes some Java-specific references, the only intended

prerequisite knowledge is a basic understanding of programming. Furthermore,

the Java-specific details of writing a MapReduce program for Apache Hadoop are

beyond the scope of this text. This omission may appear troublesome, but tools in

the Hadoop ecosystem, such as Apache Pig and Apache Hive, can often eliminate

the need to explicitly code a MapReduce program. Along with other Hadoop-related

tools, Pig and Hive are covered in a portion of this chapter dealing with the Hadoop

ecosystem.

To illustrate the power of Hadoop in handling unstructured data, the following

discussion provides several Hadoop use cases.

Search WWH ::

Custom Search

Home