Database Reference
In-Depth Information
Chapter 2
Storing and Configuring Data with
Hadoop, YARN, and ZooKeeper
This chapter introduces Hadoop versions V1 and V2, laying the groundwork for the chapters that follow. Specifically,
you first will source the V1 software, install it, and then configure it. You will test your installation by running a simple
word-count Map Reduce task. As a comparison, you will then do the same for V2, as well as install a ZooKeeper
quorum. You will then learn how to access ZooKeeper via its commands and client to examine the data that it stores.
Lastly, you will learn about the Hadoop command set in terms of shell, user, and administration commands. The
Hadoop installation that you create here will be used for storage and processing in subsequent chapters, when you
will work with Apache tools like Nutch and Pig.
An Overview of Hadoop
Apache Hadoop is available as three download types via the hadoop.apache.org website. The releases are named as
follows:
Hadoop-1.2.1
Hadoop-0.23.10
Hadoop-2.3.0
The first release relates to Hadoop V1, while the second two relate to Hadoop V2. There are two different release
types for V2 because the version that is numbered 0.xx is missing extra components like NN and HA. (NN is “name
node” and HA is “high availability.”) Because they have different architectures and are installed differently, I first
examine both Hadoop V1 and then Hadoop V2 (YARN). In the next section, I will give an overview of each version and
then move on to the interesting stuff, such as how to source and install both.
Because I have only a single small cluster available for the development of this topic, I install the different
versions of Hadoop and its tools on the same cluster nodes. If any action is carried out for the sake of demonstration,
which would otherwise be dangerous from a production point of view, I will flag it. This is important because, in
a production system, when you are upgrading, you want to be sure that you retain all of your data. However, for
demonstration purposes, I will be upgrading and downgrading periodically.
So, in general terms, what is Hadoop? Here are some of its characteristics:
It is an open-source system developed by Apache in Java.
It is designed to handle very large data sets.
It is designed to scale to very large clusters.
It is designed to run on commodity hardware.
 
Search WWH ::




Custom Search