Understanding Query Execution - Google BigQuery Analytics

Database Reference

In-Depth Information

takes a more brute force approach, reading every single row for each query.

While a database such as MySQL can skip rows it doesn't need, Dremel takes

an alternative approach; it can avoid reading columns it doesn't need.

Traditional databases store data in row-order. That is, they store all the

fields in the first row, then all the fields in the second row, and so on.

ColumnIO stores the data in column-order. Each column gets its own file.

To read from multiple columns at once, you need to open all the files you

need and iterate through each one in parallel. This read operation must be

synchronized, but because each column is coming from a different chunk

server, the I/O requests can all be performed in parallel.

Note that in a traditional storage system, reading from column-based

storage will likely be slow because the disk would have to constantly seek

for each of the column files instead of just reading sequentially. Because

BigQuery stores the data in CFS, however, each column is going to come

from a different disk in the storage cluster. This means that reading from

multiple files at once will not involve any additional seek operations.

Figure 9.1 shows a columnar layout and compares it to a record-based one.

Nested fields are treated as completely separate fields. Repeated fields are

packed within the parent field, with a special marker that indicates the start

of the next row. This makes seeking in a repeated field somewhat more

expensive than seeking in a singular field because you have to scan through

all the repeated values to get to the next row.

Figure 9.1 Record-oriented versus column-oriented storage

There are two factors that make reading from column-oriented storage

faster than record-oriented storage: selectivity and compression. Selectivity

is the ability to select only the columns needed in the query. Many tables

have a wide schema, but most queries just reference a few fields. The ability

to read only the columns needed by the query is a key feature of ColumnIO,

which can often reduce the amount of data needed to read by an order of

magnitude or more.

Search WWH ::

Custom Search

Home