Advanced Impala Concepts - Learning Cloudera Impala

Database Reference

In-Depth Information

Impala processing strategy

Now let's review how Impala starts processing a query when it is submitted through

any of the following ways:

• When a query is submitted, Impala needs two kinds of metadata to start query

processing:

• Catalog information using Hive metadata

• File metadata using NameNode

• It is strongly recommended to have the Impala daemon running on all

DataNodes, which helps Impala run distributed queries directly on the stored

data; however, if the Impala daemon is not running on all DataNodes, it still

plans to run the query as effectively and as fast as it can.

• At the time of writing this topic, Impala only supports in-memory hash aggreg-

ations.

• In the case of the JOIN operation, all of the tables referenced in the JOIN op-

eration must fit in the aggregate memory on the host or hosts where Impala is

running.

• If the JOIN operation is submitted, Impala will use either broadcast or parti-

tioned join, depending on the query planner's decision, and follow the table

order provided in the SELECT statement.

• Impala processes all queries in memory, so memory limitation on nodes is def-

initely a factor. You must have enough memory to support the resultant data-

set, which could grow multifold during complex JOIN operations.

• If a query starts processing the data and the resultant dataset cannot fit in the

available memory, the query will fail.

Search WWH ::

Custom Search

Home