Database Reference
In-Depth Information
Each column in ColumnIO is compressed separately. Dremel reads back the
compressed data and decompresses it on-the-fly as needed. Because I/O
bandwidth is by far the biggest bottleneck in the system, you can improve
your overall throughput by a factor of 10 or more by operating over
compressed data.
At first, you might not expect a column store to compress so well because
it isn't immediately apparent why it would compress much better than a
record store—after all, they're both storing the same data. You're probably
used to seeing data compress at 2 or 3 to 1 when you compress a file, so 10x
sounds unreasonable.
To see why column stores compress so well, think about what a compression
algorithm does: It searches for redundant information and re-encodes it in
a smaller way. For example, if you have the string QQQrrQQQ , you could
compress it to aba , and store the mapping of a to QQQ and b to rr . Although
this is a contrived example, it is how most compression algorithms work,
in some form. First, the algorithm scans the input and looks for repeated
strings. Then it saves the repeated data in a dictionary. Finally, it can then
replace the redundant input with an optimal encoding.
Now think back to the ColumnIO disk format, which has one file per column.
What do the individual fields look like? They're not usually random text.
They often fall into one of the following categories:
IDs : These could be a customer ID or an e-mail address. They usually
have a decent amount of redundancy (or to be technical, low entropy).
An e-mail address, for example, probably ends in “ .com . ” In addition,
there are probably a lot of e-mail addresses from the same domain, like
user@hotmail.com .
Small numbers : Most numbers are small and don't use all their
allotted 8 bytes. For example, an age column will be unlikely to have
more than 100 distinct values. Even when they're larger, numbers often
follow patterns; numbers in nature have a lot more 1s than 9s, whereas
prices have a lot more 9s than anything else.
Enumerated data : Enumerations are the kings of redundancy. Often
in a database table, you have a column for Item Code or Language.
These fields may have only a few distinct values but may take up a lot of
space writing out those values in full for each row.
Search WWH ::




Custom Search