BigQuery Fundamentals - Google BigQuery Analytics

Database Reference

In-Depth Information

disks, you can read the data much faster than you could read from a single

disk by reading from multiple different locations at once.

Not all your data is stored in Colossus, however. Data that is streamed into

BigQuery is temporarily stored in Bigtable. Small tables may be stored inline

in Megastore along with the metadata. And, of course, a number of other

storage systems at Google may be in use now or in the future to store your

table data. Although this may sound cryptic or vague, the bottom line is

you shouldn't make any assumptions about where or how your data will

be stored. You can, however, assume that Google will continue to invest in

storage systems that improve reliability, durability, and performance.

As important as the “where” of data storage is the “how.” BigQuery uses a

proprietary columnar storage format called ColumnIO. ColumnIO is tuned

to the usage patterns for BigQuery, and allows you to read just the columns

that are needed to execute a query. This not only improves performance, but

it also is what allows BigQuery to charge just for access to columns that get

referenced in a query.

Networking

As more people move to scale-out architectures for Big Data, they realize

that network connections between machines become a big bottleneck. This

mostly follows from common sense—when moving from a single machine

to multiple machines, the effective bandwidth you have available to get to

your data ends up going down by a couple of orders of magnitude. Even

in a Non-Uniform Memory Access (NUMA) machine, memory in another

node is much cheaper to access than data that resides on another machine

in the network. If you invest more heavily in the network components that

carry data from one machine to another, you can more closely replicate the

single-machine performance in a clustered network environment.

In a large network cluster, however, it is harder to ensure that you have a fast

network path between all combinations of machines. Many Big Data suites,

such as Hadoop, allow you to tune the way they run to take into account

network topology and physical distance between machines. If two machines

share the same physical rack, for instance, the bandwidth between them is

likely to be much higher than if they are in opposite sides of the datacenter.

Google guards the details of its datacenter hardware extremely closely. That

said, from public benchmarks that people have run on Google Compute

Engine, it is clear that one of the main distinguishing factors in the Google

Search WWH ::

Custom Search

Home