Advanced Impala Concepts - Learning Cloudera Impala

Database Reference

In-Depth Information

Chapter 7. Advanced Impala Concepts

In Chapter 6 , Troubleshooting Impala , we discussed various concepts about Impala,

which have definitely given you enough information to let you take charge of Impala

projects and successfully manage them. In this chapter, we are going to learn more

about Impala; however, this information is more advanced in nature, to help you excel

in data-processing projects using Impala. I describe how Impala works side by side

with MapReduce without using it in the same cluster. I also explain why Impala has an

edge over Hive even though Hive is a key component on which Impala is dependent.

Finally, we will cover some details on using HBase with Impala and processing vari-

ous Big Data input file formats on Hadoop with Impala.

Impala and MapReduce

The very first thing to note is that Impala does not replace MapReduce or use MapRe-

duce as a processing engine. Impala processes data much, much faster than MapRe-

duce and is considered an alternative data-processing framework on Hadoop. Im-

pala processes data stored at the Hadoop data storage layer using its open source

in-memory processing framework, which does not have an overhead as MapReduce

does. Impala bypasses MapReduce to have native access to data in HDFS using the

distributed query engine designed specially for superfast data processing. As each

Impala daemon processes data locally on DataNode, processing is fast due to little

or no network latency. You must know the fact that MapReduce is an amazing dis-

tributed data-processing framework to process data directly in a distributed clustered

environment on DataNodes; however, executing SQL statements through the MapRe-

duce framework exhibits performance inefficiencies mainly due to disk access. Impala

overcomes this inefficiency by processing data in memory. Impala runs side by side

with MapReduce by using the same Hadoop core components and hardware infra-

structure. As mentioned earlier and rephrased here again, Impala is faster because

the data is processed in memory; therefore, the memory requirement for Impala-in-

stalled Hadoop clusters is comparatively higher.

Search WWH ::

Custom Search

Home