Database Reference
In-Depth Information
What's in This Topic?
The topic is divided into five main parts: Parts I to III are about core Hadoop, Part IV cov-
ers related projects in the Hadoop ecosystem, and Part V contains Hadoop case studies.
You can read the topic from cover to cover, but there are alternative pathways through the
topic that allow you to skip chapters that aren't needed to read later ones. See Figure 1-1 .
Part I is made up of five chapters that cover the fundamental components in Hadoop and
should be read before tackling later chapters. Chapter 1 (this chapter) is a high-level intro-
duction to Hadoop. Chapter 2 provides an introduction to MapReduce. Chapter 3 looks at
Hadoop filesystems, and in particular HDFS, in depth. Chapter 4 discusses YARN, Ha-
doop's cluster resource management system. Chapter 5 covers the I/O building blocks in
Hadoop: data integrity, compression, serialization, and file-based data structures.
Part II has four chapters that cover MapReduce in depth. They provide useful understand-
ing for later chapters (such as the data processing chapters in Part IV ) , but could be skipped
on a first reading. Chapter 6 goes through the practical steps needed to develop a MapRe-
duce application. Chapter 7 looks at how MapReduce is implemented in Hadoop, from the
point of view of a user. Chapter 8 is about the MapReduce programming model and the
various data formats that MapReduce can work with. Chapter 9 is on advanced MapReduce
topics, including sorting and joining data.
Part III concerns the administration of Hadoop: Chapters 10 and 11 describe how to set up
and maintain a Hadoop cluster running HDFS and MapReduce on YARN.
Part IV of the topic is dedicated to projects that build on Hadoop or are closely related to it.
Each chapter covers one project and is largely independent of the other chapters in this
part, so they can be read in any order.
The first two chapters in this part are about data formats. Chapter 12 looks at Avro, a cross-
language data serialization library for Hadoop, and Chapter 13 covers Parquet, an efficient
columnar storage format for nested data.
The next two chapters look at data ingestion, or how to get your data into Hadoop.
Chapter 14 is about Flume, for high-volume ingestion of streaming data. Chapter 15 is
about Sqoop, for efficient bulk transfer of data between structured data stores (like relation-
al databases) and HDFS.
The common theme of the next four chapters is data processing, and in particular using
higher-level abstractions than MapReduce. Pig ( Chapter 16 ) is a data flow language for ex-
ploring very large datasets. Hive ( Chapter 17 ) is a data warehouse for managing data stored
in HDFS and provides a query language based on SQL. Crunch ( Chapter 18 ) is a high-level
Java API for writing data processing pipelines that can run on MapReduce or Spark. Spark
Search WWH ::




Custom Search