Database Reference
In-Depth Information
ber of records or rows, which are part of the result set. After that, select the
next largest table in terms of the number of rows in the results set, and finally
the smallest table.
• Another method to optimize a faster
JOIN
operation is to use the
HINT
clause with
JOIN
to select a specific Impala query planner. When a
SELECT
statement based on
JOIN
is given to Impala to execute, the Impala query
planner works on it to find the best strategy. The Impala query planner first
checks metadata and the number of records in the result set for each table in
the
SELECT
statement and then chooses an appropriate
JOIN
strategy. You
can get this information by using the
EXPLAIN
clause with your query. If you
think that changing the
JOIN
strategy will be helpful, you can use the
HINT
clause as follows to apply the specific
JOIN
type. You can learn more about
HINT
in any detailed SQL documentation:
SELECT table1.field1, table2.field1 FROM
table1 JOIN
[BROADCAST | SHUFFLE] table2 ON
[condition….]
The SQL
JOIN
operation itself is very large and requires a great deal of understand-
ing to optimize. While the preceding information is good for reference purposes,
I would suggest reading more details on it in an external reference document to
achieve optimum performance with
JOIN
queries.
Table and column statistics
In the previous section, we talked about the query planner. Now, we will learn how
the query planner decides which strategy is best for it. The query planner uses indi-
vidual column statistics by getting metadata from the metastore if it is available. All
the columns, which are part of the result set in the
JOIN
query, are calculated for all
the records, which helps the query planner to make its decision.
The Impala query planner also uses statistics for all the tables and partitions from the
metastore and, based on this information, it makes a decision. These table, column,
and partition statistics can be gathered using the
ANALYZE TABLE
statement by
passing the table name. I would like to inform you that the Impala query planner
does not create this information; instead it depends on Hive for this and the
ANALYZE