Impala Administration and Performance Improvements - Learning Cloudera Impala

Database Reference

In-Depth Information

Testing query performance

Most user time is spent writing and executing queries in Impala. To understand if your

Impala cluster is performing optimally, you usually measure query execution time be-

fore and after fine-tuning the Impala cluster or your query. The difference between

both measurements explains if you have achieved any positive improvements. Let's

learn how to measure query execution time precisely to make proper judgments.

Benchmarking queries

When processing terabytes of data from multiple nodes, a query runs for a long time.

If you are printing a query output for a console, the time to render the query output

on the console is still part of the query execution. It is suggested that you disable the

query output on the console by using the -B option with the query. This is because

you can get the closest execution time. The other option is to save query results in a

file using the -o option.

Verifying data locality

We have repeatedly seen that to achieve maximum performance with Impala, the

query must be distributed on every node in the cluster. You can design a query to be

executed on all the nodes in the cluster; however, how can you check if the query

actually ran on all nodes? We are going to find the answer to this question in this sec-

tion.

To find out if a query is executed on all nodes, you will have to dig inside the Impala

logs. Make sure you have Impala logging enabled and, after executing the query,

open the logs either on an editor or using Cloudera Manager or Navigator. In the logs,

if you find the following line, it means the query is not distributed and it is not running

on other nodes:

Total remote scan volume = 0

You can search for the presence of remote scan in the log files and, based on its

occurrence, you can troubleshoot this problem on your Impala cluster. More informa-

Search WWH ::

Custom Search

Home