Database Reference
In-Depth Information
processors for a single query. That said, even if they did parallelize single
query execution, the database would still be limited by disk I/O speeds—if
your data is stored on a single disk, reading the disk from multiple places in
parallel may actually be slower than reading it sequentially.
The SQL query language is highly parallelizable, however, as long as you
have a way to take advantage of it. The Dremel query engine created a
way to parallelize SQL execution across thousands of machines. Chapter
9, “Understanding Query Execution,” describes in detail how it works, but
the central principle is that it is a scale-out solution. If you want your
queries to run faster, you can throw more machines at the problem. This is
a contrast to a traditional scale-up architecture, where when you want more
performance, you buy fancier hardware.
When run in the Google infrastructure, the Dremel architecture scales
nearly linearly to tens of thousands of processor cores and hundreds of
thousands of disks. The performance goal of the system was to process a
terabyte of data in a second; although peak performance numbers have not
been published, those goals have been met and exceeded.
Of course, this doesn't mean that you'll automatically see performance in
that range; the Dremel clusters used by BigQuery are tuned for serving
multiple queries at once rather than single queries at peak speed. A rough
estimate for performance you can expect is on the order of 50 GB per
second for a simple query. More complex queries— JOIN s, complex regular
expressions, and so on—will be somewhat slower. That said, 95 percent of
all queries in the public BigQuery clusters finish in less than 5 seconds.
However, unless you reserve capacity, you may find that performance
fluctuates significantly due to load on the system.
Search WWH ::




Custom Search