Database Reference
In-Depth Information
The technology described in this chapter is not necessarily what will be
running when you use BigQuery. The individual components, from the
ColumnIO storage format to the Colossus File System to the Dremel servers,
are all undergoing constant improvement and innovation. Description of
some components in this chapter are simplified in order to prevent
disclosing confidential information. The important part is that the
high-level concepts are likely to remain the same in the future, even if the
underlying technology stack changes over time.
Storage Architecture
The most expensive part of any operation over Big Data is almost always
I/O. As previously mentioned, the disk I/O involved to read a 1 TB table will
take hours. If your goal is to interactively query a 1 TB table, you need to
figure out ways to bring the time you spend reading data down by 5 orders
of magnitude.
There are two technologies that Dremel uses to achieve (and at times far
surpass) the 1 TB per second goal. The first is called Colossus: a large,
parallel, distributed filesystem, developed at Google as a successor for the
Google File System (GFS). The second is the storage format, called
ColumnIO, which arranges the data in a manner that makes it easier to
query.
Colossus File System (CFS)
Although Google described the architecture of its predecessor GFS in a
public research paper ( http://static.googleusercontent.com/
media/research.google.com/en/us/archive/
gfs-sosp2003.pdf ), it has kept Colossus largely under wraps. Details
about Colossus are generally confidential; it is a refinement of GFS that fixes
a number of scalability problems. For now just focus on the features of CFS
that enable Dremel's super-fast query performance, which for the most part,
are the same as GFS (or the open source clone, HDFS).
Colossus is a distributed filesystem, which means that the storage is not
physically attached to the machines requesting the data, and that data is
distributed across the network. All of the data in Colossus is stored on
commodity disks. Expensive storage hardware solutions can be fast, but
they are a single point of failure and often don't scale well. Storing the data
Search WWH ::




Custom Search