Database Reference
In-Depth Information
Impala processing strategy
Now let's review how Impala starts processing a query when it is submitted through
any of the following ways:
• When a query is submitted, Impala needs two kinds of metadata to start query
processing:
• Catalog information using Hive metadata
• File metadata using NameNode
• It is strongly recommended to have the Impala daemon running on all
DataNodes, which helps Impala run distributed queries directly on the stored
data; however, if the Impala daemon is not running on all DataNodes, it still
plans to run the query as effectively and as fast as it can.
• At the time of writing this topic, Impala only supports in-memory hash aggreg-
ations.
• In the case of the JOIN operation, all of the tables referenced in the JOIN op-
eration must fit in the aggregate memory on the host or hosts where Impala is
running.
• If the JOIN operation is submitted, Impala will use either broadcast or parti-
tioned join, depending on the query planner's decision, and follow the table
order provided in the SELECT statement.
• Impala processes all queries in memory, so memory limitation on nodes is def-
initely a factor. You must have enough memory to support the resultant data-
set, which could grow multifold during complex JOIN operations.
• If a query starts processing the data and the resultant dataset cannot fit in the
available memory, the query will fail.
Search WWH ::




Custom Search