Getting Started with Impala - Learning Cloudera Impala

Database Reference

In-Depth Information

Chapter 1. Getting Started with Impala

This chapter covers the information on Impala, its core components, and its inner

workings in detail. We will cover Impala architecture including Impala daemon,

statestore, and execution model, and how they interact together along with other com-

ponents. Impala metadata and metastore are also discussed here, to understand how

Impala maintains its information. Finally, we will study various ways to interface Im-

pala.

The objective of this chapter is to provide enough information for you to kick-start Im-

pala on a single node experimental or multimode production cluster. This chapter cov-

ers the Impala essentials within the following broad categories:

• System requirement

• Installation

• Configuration

• Upgradation

• Security

• Impala architecture and execution

Impala is for a new breed of data wranglers who want to process the data at

lightening-fast speed using traditional SQL knowledge. Impala provides data analysts

or scientists a way to access data, which is stored on Hadoop at lightening speed by

directly using SQL or other Business Intelligence tools. Impala uses the Hadoop data

processing layer, also called HDFS, to process the data so there is no need to migrate

data from Hadoop to any other middleware, specialized system, or data warehouse.

Impala provides data wranglers a Massively Parallel Processing ( MPP ) query en-

gine, which runs natively on Hadoop.

Native on Hadoop means the engine runs on Hadoop and uses the Hadoop core

component, HDFS, along with other additional components, such as Hive and HBase.

To process data, Impala has its own execution component, which runs on each

DataNode where the data is stored in blocks. There is a list of third-party applications

that can directly process data stored on Hadoop through Impala. The biggest advant-

age of Impala is that data transformation or data movement is not required for data

stored on Hadoop. No data movement means all the processing is happening where

the data resides in the cluster. In other distributed systems, data is transferred over

the network before it is processed; however, with Impala the processing happens at

Search WWH ::

Custom Search

Home