Database Reference
In-Depth Information
disks, you can read the data much faster than you could read from a single
disk by reading from multiple different locations at once.
Not all your data is stored in Colossus, however. Data that is streamed into
BigQuery is temporarily stored in Bigtable. Small tables may be stored inline
in Megastore along with the metadata. And, of course, a number of other
storage systems at Google may be in use now or in the future to store your
table data. Although this may sound cryptic or vague, the bottom line is
you shouldn't make any assumptions about where or how your data will
be stored. You can, however, assume that Google will continue to invest in
storage systems that improve reliability, durability, and performance.
As important as the “where” of data storage is the “how.” BigQuery uses a
proprietary columnar storage format called ColumnIO. ColumnIO is tuned
to the usage patterns for BigQuery, and allows you to read just the columns
that are needed to execute a query. This not only improves performance, but
it also is what allows BigQuery to charge just for access to columns that get
referenced in a query.
Networking
As more people move to scale-out architectures for Big Data, they realize
that network connections between machines become a big bottleneck. This
mostly follows from common sense—when moving from a single machine
to multiple machines, the effective bandwidth you have available to get to
your data ends up going down by a couple of orders of magnitude. Even
in a Non-Uniform Memory Access (NUMA) machine, memory in another
node is much cheaper to access than data that resides on another machine
in the network. If you invest more heavily in the network components that
carry data from one machine to another, you can more closely replicate the
single-machine performance in a clustered network environment.
In a large network cluster, however, it is harder to ensure that you have a fast
network path between all combinations of machines. Many Big Data suites,
such as Hadoop, allow you to tune the way they run to take into account
network topology and physical distance between machines. If two machines
share the same physical rack, for instance, the bandwidth between them is
likely to be much higher than if they are in opposite sides of the datacenter.
Google guards the details of its datacenter hardware extremely closely. That
said, from public benchmarks that people have run on Google Compute
Engine, it is clear that one of the main distinguishing factors in the Google
Search WWH ::




Custom Search