Understanding Query Execution - Google BigQuery Analytics

Database Reference

In-Depth Information

Each column in ColumnIO is compressed separately. Dremel reads back the

compressed data and decompresses it on-the-fly as needed. Because I/O

bandwidth is by far the biggest bottleneck in the system, you can improve

your overall throughput by a factor of 10 or more by operating over

compressed data.

At first, you might not expect a column store to compress so well because

it isn't immediately apparent why it would compress much better than a

record store—after all, they're both storing the same data. You're probably

used to seeing data compress at 2 or 3 to 1 when you compress a file, so 10x

sounds unreasonable.

To see why column stores compress so well, think about what a compression

algorithm does: It searches for redundant information and re-encodes it in

a smaller way. For example, if you have the string QQQrrQQQ , you could

compress it to aba , and store the mapping of a to QQQ and b to rr . Although

this is a contrived example, it is how most compression algorithms work,

in some form. First, the algorithm scans the input and looks for repeated

strings. Then it saves the repeated data in a dictionary. Finally, it can then

replace the redundant input with an optimal encoding.

Now think back to the ColumnIO disk format, which has one file per column.

What do the individual fields look like? They're not usually random text.

They often fall into one of the following categories:

• IDs : These could be a customer ID or an e-mail address. They usually

have a decent amount of redundancy (or to be technical, low entropy).

An e-mail address, for example, probably ends in “ .com . ” In addition,

there are probably a lot of e-mail addresses from the same domain, like

“ user@hotmail.com . ”

• Small numbers : Most numbers are small and don't use all their

allotted 8 bytes. For example, an age column will be unlikely to have

more than 100 distinct values. Even when they're larger, numbers often

follow patterns; numbers in nature have a lot more 1s than 9s, whereas

prices have a lot more 9s than anything else.

• Enumerated data : Enumerations are the kings of redundancy. Often

in a database table, you have a column for Item Code or Language.

These fields may have only a few distinct values but may take up a lot of

space writing out those values in full for each row.

Search WWH ::

Custom Search

Home