Impala Administration and Performance Improvements - Learning Cloudera Impala

Database Reference

In-Depth Information

4. Restart all DataNodes.

Adding more Impala nodes to achieve higher

performance

It is a fact that Impala performance improves if more nodes are added to the cluster.

In the same way, Hadoop performance improves by adding more DataNodes and

TaskTrackers. Having more nodes in the Hadoop cluster will distribute the data to

more clusters, and queries will have more distribution, which ultimately will return

higher performance.

Optimizing

memory

usage

during

query

execution

You can improve query performance by restricting the amount of memory consumed

by a query during its execution and you can do that by setting the -mem_limits flag

when starting Impala daemon. This flag will restrict the memory consumed only by a

query; however, there is still memory available for starting Impala to cache metadata

and perform other startup actions.

Query execution dependency on memory

You might wonder about memory limitation impact on query execution as Impala

has a strong dependency on available memory. If dataset size exceeds the available

memory in a machine, the query will fail. The memory usages in Impala are not dir-

ectly based on the input dataset size; instead it varies depending on types of query.

An aggregation will require memory equivalent to the number of rows after grouping;

however, join queries require memory equivalent to the combined size of remaining

tables excluding the biggest table.

Using resource isolation

If you are using Cloudera Manager, you have the ability to implement resource isol-

ation using the cgroups mechanism and it can be achieved by configuring Cloudera

Manager. For more information, please read the Cloudera Manager documentation

on resource isolation.

Search WWH ::

Custom Search

Home