Database Reference
In-Depth Information
Earlier in the chapter you saw how to deal with queries that have the first
property by using the EACH qualifier in the GROUP BY clause. Queries that
have large memory requirements for each group handled can fail even when
the EACH qualifier is used. In some cases it is possible to avoid this resource
limit; for example, the “Exact Count Distinct” section in this chapter
presents a scalable alternative to the COUNT(DISTINCT field ) function,
which often causes resource exhaustion errors. Unfortunately, there is no
reliable procedure for fixing all queries that encounter this issue. Here we
discuss the class of queries that generally encounter this issue and
suggestions for how to modify them so that they succeed.
Most of the familiar aggregation functions have outputs that are
independent of the number of values fed to them, for example COUNT ,
SUM , MIN , and MAX . However, the GROUP_CONCAT and NEST aggregation
functions have outputs that are proportional to the size of their inputs.
Other aggregation functions have theoretically constant output size, but the
constant is large as in the case of COUNT(DISTINCT field, N ) with a
large value for N (the constant is proportional N ). Finally, operations like
QUANTILES have intermediate state that grows logarithmically with input
values, but once again have a large constant factor that can be problematic.
Because all these consume a substantial amount of memory per output
record, they tend to cause memory limit violations as the results are
collected up the execution tree. To avoid the error you need to modify
the query to generate fewer total output rows by limiting the set of input
values considered. This may require building up the final result by running
multiple queries, all appending their results to the same final result table.
When a query does not involve a join, the number of output rows is strictly
less than or equal to the number of input rows. When a join is present,
then for each distinct value of the join key, the number of output rows
is the product of the number of rows in the left and right table with the
given key. The query might eventually discard or aggregate these rows, but
they must at least be transiently generated. Most of the time joins deal
with one-to-many relationships, but when you work with a many-to-many
relationship between two tables, the amount of intermediate state can easily
grow large. This can lead to high-memory usage in the leaf nodes of the
execution engine. Note that even in cases where the memory usage is not an
issue, this can cause the query to run slowly. In some cases you can control
this expansion of data by adjusting the query. The Cross Join query recipe in
Search WWH ::




Custom Search