Meet Hadoop - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

What's in This Topic?

The topic is divided into five main parts: Parts I to III are about core Hadoop, Part IV cov-

ers related projects in the Hadoop ecosystem, and Part V contains Hadoop case studies.

You can read the topic from cover to cover, but there are alternative pathways through the

topic that allow you to skip chapters that aren't needed to read later ones. See Figure 1-1 .

Part I is made up of five chapters that cover the fundamental components in Hadoop and

should be read before tackling later chapters. Chapter 1 (this chapter) is a high-level intro-

duction to Hadoop. Chapter 2 provides an introduction to MapReduce. Chapter 3 looks at

Hadoop filesystems, and in particular HDFS, in depth. Chapter 4 discusses YARN, Ha-

doop's cluster resource management system. Chapter 5 covers the I/O building blocks in

Hadoop: data integrity, compression, serialization, and file-based data structures.

Part II has four chapters that cover MapReduce in depth. They provide useful understand-

ing for later chapters (such as the data processing chapters in Part IV ) , but could be skipped

on a first reading. Chapter 6 goes through the practical steps needed to develop a MapRe-

duce application. Chapter 7 looks at how MapReduce is implemented in Hadoop, from the

point of view of a user. Chapter 8 is about the MapReduce programming model and the

various data formats that MapReduce can work with. Chapter 9 is on advanced MapReduce

topics, including sorting and joining data.

Part III concerns the administration of Hadoop: Chapters 10 and 11 describe how to set up

and maintain a Hadoop cluster running HDFS and MapReduce on YARN.

Part IV of the topic is dedicated to projects that build on Hadoop or are closely related to it.

Each chapter covers one project and is largely independent of the other chapters in this

part, so they can be read in any order.

The first two chapters in this part are about data formats. Chapter 12 looks at Avro, a cross-

language data serialization library for Hadoop, and Chapter 13 covers Parquet, an efficient

columnar storage format for nested data.

The next two chapters look at data ingestion, or how to get your data into Hadoop.

Chapter 14 is about Flume, for high-volume ingestion of streaming data. Chapter 15 is

about Sqoop, for efficient bulk transfer of data between structured data stores (like relation-

al databases) and HDFS.

The common theme of the next four chapters is data processing, and in particular using

higher-level abstractions than MapReduce. Pig ( Chapter 16 ) is a data flow language for ex-

ploring very large datasets. Hive ( Chapter 17 ) is a data warehouse for managing data stored

in HDFS and provides a query language based on SQL. Crunch ( Chapter 18 ) is a high-level

Java API for writing data processing pipelines that can run on MapReduce or Spark. Spark

Search WWH ::

Custom Search

Home