Impala Administration and Performance Improvements - Learning Cloudera Impala

Database Reference

In-Depth Information

ber of records or rows, which are part of the result set. After that, select the

next largest table in terms of the number of rows in the results set, and finally

the smallest table.

• Another method to optimize a faster JOIN operation is to use the HINT

clause with JOIN to select a specific Impala query planner. When a SELECT

statement based on JOIN is given to Impala to execute, the Impala query

planner works on it to find the best strategy. The Impala query planner first

checks metadata and the number of records in the result set for each table in

the SELECT statement and then chooses an appropriate JOIN strategy. You

can get this information by using the EXPLAIN clause with your query. If you

think that changing the JOIN strategy will be helpful, you can use the HINT

clause as follows to apply the specific JOIN type. You can learn more about

HINT in any detailed SQL documentation:

SELECT table1.field1, table2.field1 FROM

table1 JOIN

[BROADCAST | SHUFFLE] table2 ON

[condition….]

The SQL JOIN operation itself is very large and requires a great deal of understand-

ing to optimize. While the preceding information is good for reference purposes,

I would suggest reading more details on it in an external reference document to

achieve optimum performance with JOIN queries.

Table and column statistics

In the previous section, we talked about the query planner. Now, we will learn how

the query planner decides which strategy is best for it. The query planner uses indi-

vidual column statistics by getting metadata from the metastore if it is available. All

the columns, which are part of the result set in the JOIN query, are calculated for all

the records, which helps the query planner to make its decision.

The Impala query planner also uses statistics for all the tables and partitions from the

metastore and, based on this information, it makes a decision. These table, column,

and partition statistics can be gathered using the ANALYZE TABLE statement by

passing the table name. I would like to inform you that the Impala query planner

does not create this information; instead it depends on Hive for this and the ANALYZE

Search WWH ::

Custom Search

Home