Database Reference
In-Depth Information
likely data non-availability scenario would be if the BigQuery servers had a
global outage.
Of course, this section describes a snapshot in time; BigQuery may increase
or decrease the replication factor. Durability and availability of data are
and will continue to be top priorities of Google and BigQuery. For more
reading on the subject, check out this blog post on disaster data recovery for
Google enterprise data: http://googleenterprise.blogspot.com/
2010/03/disaster-recovery-by-google.html . Although this
article was not specifically written with BigQuery in mind, the policies
described are similar.
Query Processing
SQL is a declarative language rather than an imperative one. This is a
fancy way of saying that you declare what you want to happen, rather than
describe how you want it to happen. Without this property, the switch from
a standard, sequentially processed relational database to a parallel query
engine like Dremel would not be nearly as easy. Imagine if, instead of a
WHERE clause, you had to describe how to look up the data in a B-Tree (an
on disk data structure that backs most relational databases). If this was the
case, you'd be stuck with databases that use B-Trees (and, most likely, only
programmers would be able to figure out how to run queries).
With SQL, however, the query describes precisely which data you want to
be returned in the query and leaves up to the database implementation how
it wants to get that data. You might have an Oracle database responding to
your queries, or you might have an overworked graduate student typing all
the responses by hand.
In the last section, you saw that between the Colossus distributed filesystem
and ColumnIO, you can easily meet the goal of reading the data for a 1
TB table in less than 1 second. Of course, just because you can get all of
that data off a disk quickly doesn't mean it is going to be fast to run the
query. For instance, you're not going to be able to process it all on a single
machine—even if you have a magical 10 terabit Ethernet; the fastest memory
bandwidth is something like 25 GB per second, which means you'd need 40
seconds just to read the data out of memory. And then if you were able to
only spend one processor cycle per row, it would take almost a minute to
process a terabyte table.
Search WWH ::




Custom Search