Database Reference
In-Depth Information
Testing query performance
Most user time is spent writing and executing queries in Impala. To understand if your
Impala cluster is performing optimally, you usually measure query execution time be-
fore and after fine-tuning the Impala cluster or your query. The difference between
both measurements explains if you have achieved any positive improvements. Let's
learn how to measure query execution time precisely to make proper judgments.
Benchmarking queries
When processing terabytes of data from multiple nodes, a query runs for a long time.
If you are printing a query output for a console, the time to render the query output
on the console is still part of the query execution. It is suggested that you disable the
query output on the console by using the -B option with the query. This is because
you can get the closest execution time. The other option is to save query results in a
file using the -o option.
Verifying data locality
We have repeatedly seen that to achieve maximum performance with Impala, the
query must be distributed on every node in the cluster. You can design a query to be
executed on all the nodes in the cluster; however, how can you check if the query
actually ran on all nodes? We are going to find the answer to this question in this sec-
tion.
To find out if a query is executed on all nodes, you will have to dig inside the Impala
logs. Make sure you have Impala logging enabled and, after executing the query,
open the logs either on an editor or using Cloudera Manager or Navigator. In the logs,
if you find the following line, it means the query is not distributed and it is not running
on other nodes:
Total remote scan volume = 0
You can search for the presence of remote scan in the log files and, based on its
occurrence, you can troubleshoot this problem on your Impala cluster. More informa-
Search WWH ::




Custom Search