Storing and Configuring Data with Hadoop, YARN, and ZooKeeper - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Chapter 2

Storing and Configuring Data with

Hadoop, YARN, and ZooKeeper

This chapter introduces Hadoop versions V1 and V2, laying the groundwork for the chapters that follow. Specifically,

you first will source the V1 software, install it, and then configure it. You will test your installation by running a simple

word-count Map Reduce task. As a comparison, you will then do the same for V2, as well as install a ZooKeeper

quorum. You will then learn how to access ZooKeeper via its commands and client to examine the data that it stores.

Lastly, you will learn about the Hadoop command set in terms of shell, user, and administration commands. The

Hadoop installation that you create here will be used for storage and processing in subsequent chapters, when you

will work with Apache tools like Nutch and Pig.

An Overview of Hadoop

Apache Hadoop is available as three download types via the hadoop.apache.org website. The releases are named as

follows:

•

Hadoop-1.2.1

•

Hadoop-0.23.10

•

Hadoop-2.3.0

The first release relates to Hadoop V1, while the second two relate to Hadoop V2. There are two different release

types for V2 because the version that is numbered 0.xx is missing extra components like NN and HA. (NN is “name

node” and HA is “high availability.”) Because they have different architectures and are installed differently, I first

examine both Hadoop V1 and then Hadoop V2 (YARN). In the next section, I will give an overview of each version and

then move on to the interesting stuff, such as how to source and install both.

Because I have only a single small cluster available for the development of this topic, I install the different

versions of Hadoop and its tools on the same cluster nodes. If any action is carried out for the sake of demonstration,

which would otherwise be dangerous from a production point of view, I will flag it. This is important because, in

a production system, when you are upgrading, you want to be sure that you retain all of your data. However, for

demonstration purposes, I will be upgrading and downgrading periodically.

So, in general terms, what is Hadoop? Here are some of its characteristics:

•

It is an open-source system developed by Apache in Java.

•

It is designed to handle very large data sets.

•

It is designed to scale to very large clusters.

•

It is designed to run on commodity hardware.

Search WWH ::

Custom Search

Home