Understanding Query Execution - Google BigQuery Analytics

Database Reference

In-Depth Information

From these rough calculations, it should be apparent that you need to scale

out the query processing so that it can be done on multiple machines in

parallel, especially if you don't want to invest in absurdly expensive custom

hardware. In general, Google prefers to buy a lot of off-the-shelf hardware

and figure out ways to make algorithms scale out. Any query processing

solution, therefore, should run on commodity hardware.

This section describes the Dremel query engine, which is Google's solution

for scaling out SQL queries by processing them in parallel.

Dremel Serving Trees

If you were going to construct a parallel SQL engine, how would you do it?

You'd likely want to have a number of independent workers, each operating

over a small subset of the data. For some simple queries, this might be all

you need. Take, for example, the query SELECT field1 FROM table1

WHERE field2 > 0 . This query is trivially parallelizable. Each worker,

operating over a portion of the data, can read the two required fields, and

return the field1 values whenever field2 matches the filter condition.

You may suspect that not all queries will be this easy, and if so, you are

right. Now take a similar query: SELECT COUNT( field1 ) FROM table1

WHERE field2 > 0 . This seems like it should be easy too, but instead of

just returning the field1 values, you need to count values. Because you

need to return a single total count, you can't just have each parallel worker

operate independently. However, you can have another worker aggregate

the results and compute the total sum. Let's call the aggregator the “mixer,”

since it mixes data from multiple workers, and the workers that are reading

the data directly “shards.”

After partitioning the query workers into a mixer and shards, to compute the

query results you can send the filter portion of the query ( SELECT field1

WHERE field2 > 0 ) to each of the shards and have them send the filtered

results to the mixer. The mixer would then count the results and report the

final sum.

Alternatively, you could pre-aggregate the count in the shards. That is,

compute the local count in the shards, and the mixer would just have to sum

up the partial counts. This way of running the query means that a lot more

of the work can be done in the shards, since the mixer only has to compute

Search WWH ::

Custom Search

Home