Database Reference
In-Depth Information
From these rough calculations, it should be apparent that you need to scale
out the query processing so that it can be done on multiple machines in
parallel, especially if you don't want to invest in absurdly expensive custom
hardware. In general, Google prefers to buy a lot of off-the-shelf hardware
and figure out ways to make algorithms scale out. Any query processing
solution, therefore, should run on commodity hardware.
This section describes the Dremel query engine, which is Google's solution
for scaling out SQL queries by processing them in parallel.
Dremel Serving Trees
If you were going to construct a parallel SQL engine, how would you do it?
You'd likely want to have a number of independent workers, each operating
over a small subset of the data. For some simple queries, this might be all
you need. Take, for example, the query SELECT field1 FROM table1
WHERE field2 > 0 . This query is trivially parallelizable. Each worker,
operating over a portion of the data, can read the two required fields, and
return the field1 values whenever field2 matches the filter condition.
You may suspect that not all queries will be this easy, and if so, you are
right. Now take a similar query: SELECT COUNT( field1 ) FROM table1
WHERE field2 > 0 . This seems like it should be easy too, but instead of
just returning the field1 values, you need to count values. Because you
need to return a single total count, you can't just have each parallel worker
operate independently. However, you can have another worker aggregate
the results and compute the total sum. Let's call the aggregator the “mixer,”
since it mixes data from multiple workers, and the workers that are reading
the data directly “shards.”
After partitioning the query workers into a mixer and shards, to compute the
query results you can send the filter portion of the query ( SELECT field1
WHERE field2 > 0 ) to each of the shards and have them send the filtered
results to the mixer. The mixer would then count the results and report the
final sum.
Alternatively, you could pre-aggregate the count in the shards. That is,
compute the local count in the shards, and the mixer would just have to sum
up the partial counts. This way of running the query means that a lot more
of the work can be done in the shards, since the mixer only has to compute
Search WWH ::




Custom Search