Advanced Impala Concepts - Learning Cloudera Impala

Database Reference

In-Depth Information

Why Impala is faster than Hive in query

processing

We have mentioned many times in this topic that Impala is a very fast distributed data-

processing framework, so you might want to know how Impala achieves such speed

or what is behind Impala that makes it so fast. I would answer this question by provid-

ing the following key points:

• While processing SQL-like queries, Impala does not write intermediate results

on disk; instead full SQL processing is done in memory, which makes it faster.

• With Impala, the query starts its execution instantly compared to MapReduce,

which may take significant time to start processing larger SQL queries and this

adds more time in processing.

• Impala Query Planner uses smart algorithms to execute queries in multiple

stages in parallel nodes to provide results faster, avoiding sorting and shuffle

steps, which may be unnecessary in most of the cases.

• Impala has information about each data block in HDFS, so when processing

the query, it takes advantage of this knowledge to distribute queries more

evenly in all DataNodes.

• Another key reason for fast performance is that Impala first generates

assembly-level code for each query. The assembly code executes faster than

any other code framework because while Impala queries are running natively

in memory, having a framework will add additional delay in the execution due

to the framework overhead.

Search WWH ::

Custom Search

Home