Database Reference
In-Depth Information
Fine-tuning Impala performance
In this section, let's review a few key factors that affect Impala performance.
Partitioning
In this method, data is physically divided from frequently queried fields or columns
into different values. This way, when a query is executed, it processes only a specific
partition or a portion of the data, achieving significant faster results than the full data-
set. In general, data files specific to a single table reside in a single directory. Using
partitioning, you can distribute the data in a way such that a fraction of data should be
read, depending on the query and its data-limiting clause. Once a partition is applied
to a table, the data is physically loaded on a different location on the disk based on
the query parameter, which provides faster access to the data when queried by the
SELECT
statement and the partition name.
You can write
TABLE
specific Impala SQL statements as follows to take advantage of
the
PARTITIONED BY
method:
• Add a partition when creating a table by adding the
PARTITIONED BY
clause
as follows:
CREATE TABLE […] PARTITIONED BY
• You can modify an existing table to support partitions by using the
ALTER
TABLE
statement followed by
PARTITONED BY
as follows:
ALTER TABLE […] PARTITIONED BY
Join queries
It is a well-known fact that a query operating on multiple tables using the
JOIN
opera-
tion will take a long time to finish if it is not written correctly. Here are some techniques
you can use in this regard:
• A well-known technique for fine-tuning an Impala
join
query is to specify the
tables in an optimal order by first having the table that has the maximum num-