Database Reference
In-Depth Information
Key Concepts
Hadoop
Hadoop Ecosystem
MapReduce
NoSQL
Chapter 4, “Advanced Analytical Theory and Methods: Clustering,” through Chapter
9, “Advanced Analytical Theory and Methods: Text Analysis,” covered several useful
analytical methods to classify, predict, and examine relationships within the data.
This chapter and Chapter 11, “Advanced Analytics—Technology and Tools:
In-Database Analytics,” address several aspects of collecting, storing, and
processing unstructured and structured data, respectively. This chapter presents
some key technologies and tools related to the Apache Hadoop software library,
“a framework that allows for the distributed processing of large datasets across
clusters of computers using simple programming models” [1].
This chapter focuses on how Hadoop stores data in a distributed system and how
Hadoop implements a simple programming paradigm known as MapReduce.
Although this chapter makes some Java-specific references, the only intended
prerequisite knowledge is a basic understanding of programming. Furthermore,
the Java-specific details of writing a MapReduce program for Apache Hadoop are
beyond the scope of this text. This omission may appear troublesome, but tools in
the Hadoop ecosystem, such as Apache Pig and Apache Hive, can often eliminate
the need to explicitly code a MapReduce program. Along with other Hadoop-related
tools, Pig and Hive are covered in a portion of this chapter dealing with the Hadoop
ecosystem.
To illustrate the power of Hadoop in handling unstructured data, the following
discussion provides several Hadoop use cases.
Search WWH ::




Custom Search