Impala Administration and Performance Improvements - Learning Cloudera Impala

Database Reference

In-Depth Information

Fine-tuning Impala performance

In this section, let's review a few key factors that affect Impala performance.

Partitioning

In this method, data is physically divided from frequently queried fields or columns

into different values. This way, when a query is executed, it processes only a specific

partition or a portion of the data, achieving significant faster results than the full data-

set. In general, data files specific to a single table reside in a single directory. Using

partitioning, you can distribute the data in a way such that a fraction of data should be

read, depending on the query and its data-limiting clause. Once a partition is applied

to a table, the data is physically loaded on a different location on the disk based on

the query parameter, which provides faster access to the data when queried by the

SELECT statement and the partition name.

You can write TABLE specific Impala SQL statements as follows to take advantage of

the PARTITIONED BY method:

• Add a partition when creating a table by adding the PARTITIONED BY clause

as follows:

CREATE TABLE […] PARTITIONED BY

• You can modify an existing table to support partitions by using the ALTER

TABLE statement followed by PARTITONED BY as follows:

ALTER TABLE […] PARTITIONED BY

Join queries

It is a well-known fact that a query operating on multiple tables using the JOIN opera-

tion will take a long time to finish if it is not written correctly. Here are some techniques

you can use in this regard:

• A well-known technique for fine-tuning an Impala join query is to specify the

tables in an optimal order by first having the table that has the maximum num-

Search WWH ::

Custom Search

Home