Database Reference
In-Depth Information
on standard server hard drives means that you can afford a lot more of
them—you just need to be prepared when some of them inevitably fail.
The machines that contain the disks and serve up the data are called chunk
servers . The term “chunk” refers to portions of the files; a large file will
be split into multiple chunks, and each will be stored on different physical
disks. This partitioning means that you can get higher effective read
bandwidth because you can read from many of these disks in parallel.
Dremel takes advantage of this; when you run a query, it can read your data
from thousands of disks at once.
Splitting the data into multiple partitions that can be read in parallel is a
powerful way to make reads fast, but it isn't sufficient for the performance
required by Dremel. There are a lot of reasons that reading a particular
chunk could be slow; the machine serving it could be overloaded, it could
have crashed, there could be network congestion, or the disk could be going
bad. Although the probability of each of these problems is small, when you
read from thousands of disks, the chances that at least one has a problem
gets much higher.
The term for having a laggard or two among a lot of samples is called tail
latency . A query is only as fast as the slowest disk; a problem in the “tail” of
the latency distribution can significantly affect query performance. One way
that Colossus handles tail latency is via replication. That is, multiple copies
of the same data are stored in different locations. So if one chunk server is
slow, the data can be fetched from somewhere else.
The published Dremel paper mentioned another way of limiting tail
latency—ignoring data that takes too long to read, as long as enough of it can
be read. Internal Google services use this option, usually requiring only 98
percent of the data to be read before calling a query successful. BigQuery,
however, does not use this option; for a BigQuery query to succeed, every
last byte of data must be read.
ColumnIO
ColumnIO is the primary file format used to store data in BigQuery. In a
traditional database, data is laid out on disk in order to ensure access is as
fast as possible for typical workloads. The ColumnIO data format is laid out
to ensure fast access for Dremel workloads. Traditional databases rely on
indexes so that they can skip to the data they need for the query. Dremel
Search WWH ::




Custom Search