BigQuery Fundamentals - Google BigQuery Analytics

Database Reference

In-Depth Information

processors for a single query. That said, even if they did parallelize single

query execution, the database would still be limited by disk I/O speeds—if

your data is stored on a single disk, reading the disk from multiple places in

parallel may actually be slower than reading it sequentially.

The SQL query language is highly parallelizable, however, as long as you

have a way to take advantage of it. The Dremel query engine created a

way to parallelize SQL execution across thousands of machines. Chapter

9, “Understanding Query Execution,” describes in detail how it works, but

the central principle is that it is a scale-out solution. If you want your

queries to run faster, you can throw more machines at the problem. This is

a contrast to a traditional scale-up architecture, where when you want more

performance, you buy fancier hardware.

When run in the Google infrastructure, the Dremel architecture scales

nearly linearly to tens of thousands of processor cores and hundreds of

thousands of disks. The performance goal of the system was to process a

terabyte of data in a second; although peak performance numbers have not

been published, those goals have been met and exceeded.

Of course, this doesn't mean that you'll automatically see performance in

that range; the Dremel clusters used by BigQuery are tuned for serving

multiple queries at once rather than single queries at peak speed. A rough

estimate for performance you can expect is on the order of 50 GB per

second for a simple query. More complex queries— JOIN s, complex regular

expressions, and so on—will be somewhat slower. That said, 95 percent of

all queries in the public BigQuery clusters finish in less than 5 seconds.

However, unless you reserve capacity, you may find that performance

fluctuates significantly due to load on the system.

Search WWH ::

Custom Search

Home