Database Reference
In-Depth Information
Chapter 9
Analytics with Hadoop
Analytics is the process of finding significance in data, meaning that it can support decision making. Decision
makers turning to Hadoop's data for their answers will find numerous analytics options. For example, Hadoop-based
databases like Apache Hive and Cloudera Impala offer SQL-like interfaces with HDFS-based data. For in-memory
data processing, Apache Spark is available at a processing rate that is an order faster than Hadoop. For those who have
had experience with relational databases, these SQL-like languages can be a simple path into analytics on Hadoop.
In this chapter, I will explain the building blocks of each Hadoop application's SQL-like language. The actions
you need to take to transform your data will impact the methods you use to create your complex SQL. Just as in
Chapter 4, when I introduced Pig user-defined functions (UDFs) to extend the functionality of Pig, in this chapter I
will create and use a Hive UDF in an example using Hive QL. I begin with coverage of Cloudera Impala, move on to
Apache Hive, and close the chapter with a discussion of Apache Spark.
Cloudera Impala
Released via an Apache license and provided as an open-source system, Cloudera Impala is a massively parallel
processing (MPP) SQL query engine for Apache Hadoop. Forming part of Cloudera's data hub concept, Impala is
the company's Hadoop-based database offering. Chapter 8 introduced the Cloudera cluster manager, which offers
monitoring, security, and a well-defined upgrade process. Impala integrates with this architecture, and given that it
uses HDFS, offers a low-cost, easily expandable, reliable, and robust storage system.
By way of example, I install Impala from the Cloudera CDH 4 stack, then demonstrate how to use it via the Hue
application, as well as how to put Impala's shell tool and query language to work.
Installation of Impala
The chapter builds on work carried out in previous chapters on Cloudera's CDH 4 Hadoop stack. So, to follow this
installation, you will need to have completed the installation of the CDH4 Hadoop stack in Chapter 2 and of the Hue
application in Chapter 7. I install Impala manually on server hc1r1m1 and access it via the Hue tool that is already
installed on the server hc1nn. I choose the server hc1r1m1 because I have a limited number of servers and I want to
spread the processing load though my cluster.
So, the first step is to install some components required by Impala, as the Linux root user on the server hc1r1m1,
by issuing the following Linux yum command. (These components may already be installed on your servers; if so, they
will not be reinstalled. This step just ensures that they are available now):
[root@hc1r1m1 ~]# yum install python-devel openssl-devel python-pip
 
Search WWH ::




Custom Search