Database Reference
In-Depth Information
Chapter 7. Advanced Impala Concepts
In Chapter 6 , Troubleshooting Impala , we discussed various concepts about Impala,
which have definitely given you enough information to let you take charge of Impala
projects and successfully manage them. In this chapter, we are going to learn more
about Impala; however, this information is more advanced in nature, to help you excel
in data-processing projects using Impala. I describe how Impala works side by side
with MapReduce without using it in the same cluster. I also explain why Impala has an
edge over Hive even though Hive is a key component on which Impala is dependent.
Finally, we will cover some details on using HBase with Impala and processing vari-
ous Big Data input file formats on Hadoop with Impala.
Impala and MapReduce
The very first thing to note is that Impala does not replace MapReduce or use MapRe-
duce as a processing engine. Impala processes data much, much faster than MapRe-
duce and is considered an alternative data-processing framework on Hadoop. Im-
pala processes data stored at the Hadoop data storage layer using its open source
in-memory processing framework, which does not have an overhead as MapReduce
does. Impala bypasses MapReduce to have native access to data in HDFS using the
distributed query engine designed specially for superfast data processing. As each
Impala daemon processes data locally on DataNode, processing is fast due to little
or no network latency. You must know the fact that MapReduce is an amazing dis-
tributed data-processing framework to process data directly in a distributed clustered
environment on DataNodes; however, executing SQL statements through the MapRe-
duce framework exhibits performance inefficiencies mainly due to disk access. Impala
overcomes this inefficiency by processing data in memory. Impala runs side by side
with MapReduce by using the same Hadoop core components and hardware infra-
structure. As mentioned earlier and rephrased here again, Impala is faster because
the data is processed in memory; therefore, the memory requirement for Impala-in-
stalled Hadoop clusters is comparatively higher.
Search WWH ::




Custom Search