Database Reference
In-Depth Information
takes a more brute force approach, reading every single row for each query.
While a database such as MySQL can skip rows it doesn't need, Dremel takes
an alternative approach; it can avoid reading columns it doesn't need.
Traditional databases store data in row-order. That is, they store all the
fields in the first row, then all the fields in the second row, and so on.
ColumnIO stores the data in column-order. Each column gets its own file.
To read from multiple columns at once, you need to open all the files you
need and iterate through each one in parallel. This read operation must be
synchronized, but because each column is coming from a different chunk
server, the I/O requests can all be performed in parallel.
Note that in a traditional storage system, reading from column-based
storage will likely be slow because the disk would have to constantly seek
for each of the column files instead of just reading sequentially. Because
BigQuery stores the data in CFS, however, each column is going to come
from a different disk in the storage cluster. This means that reading from
multiple files at once will not involve any additional seek operations.
Figure 9.1 shows a columnar layout and compares it to a record-based one.
Nested fields are treated as completely separate fields. Repeated fields are
packed within the parent field, with a special marker that indicates the start
of the next row. This makes seeking in a repeated field somewhat more
expensive than seeking in a singular field because you have to scan through
all the repeated values to get to the next row.
Figure 9.1 Record-oriented versus column-oriented storage
There are two factors that make reading from column-oriented storage
faster than record-oriented storage: selectivity and compression. Selectivity
is the ability to select only the columns needed in the query. Many tables
have a wide schema, but most queries just reference a few fields. The ability
to read only the columns needed by the query is a key feature of ColumnIO,
which can often reduce the amount of data needed to read by an order of
magnitude or more.
 
 
Search WWH ::




Custom Search