Database Reference
In-Depth Information
a block index, which is stored after all data blocks, stores the offset of each block
and is used to locate a specific block. In order to achieve storage efficiency, Llama
uses block-level compression by using any of the well-known compression schemes.
In order to improve the query processing and the performance of join operations,
Llama columns are formed into correlation groups to provide the basis for the
vertical partitioning of tables. In particular, it creates multiple vertical groups where
each group is defined by a collection of columns, one of them is specified as the
sorting column. Initially, when a new table is imported into the system, a basic
vertical group is created which contains all the columns of the table and sorted
by the table's primary key by default. In addition, based on statistics of query
patterns, some auxiliary groups are dynamically created or discarded to improve
the query performance. The Clydesdale system [ 73 , 157 ], a system which has been
implemented for targeting workloads where the data fits a star schema, uses CFile
for storing its fact tables. It also relies on tailored join plans and block iteration
mechanism [ 243 ] for optimizing the execution of its target workloads.
RCFile [ 146 ] (Record Columnar File) is another data placement structure that
provides column-wise storage for Hadoop file system (HDFS). In RCFile, each
table is firstly stored as horizontally partitioned into multiple row groups where each
row group is then vertically partitioned so that each column is stored independently
(Fig. 9.8 ). In particular, each table can have multiple HDFS blocks where each block
organizes records with the basic unit of a row group. Depending on the row group
size and the HDFS block size, an HDFS block can have only one or multiple row
groups. In particular, a row group contains the following three sections:
1. The sync marker which is placed in the beginning of the row group and mainly
used to separate two continuous row groups in an HDFS block.
2. A metadata header which stores the information items on how many records are
in this row group, how many bytes are in each column and how many bytes are
in each field in a column.
3. The table data section which is actually a column-store where all the fields in the
same column are stored continuously together.
RCFile utilizes a column-wise data compression within each row group and pro-
vides a lazy decompression technique to avoid unnecessary column decompression
during query execution. In particular, the metadata header section is compressed
using the RLE (Run Length Encoding) algorithm. The table data section is not
compressed as a whole unit. However, each column is independently compressed
with the Gzip compression algorithm. When processing a row group, RCFile does
not need to fully read the whole content of the row group into memory. It only
reads the metadata header and the needed columns in the row group for a given
query and thus it can skip unnecessary columns and gain the I/O advantages of a
column-store. The metadata header is always decompressed and held in memory
until RCFile processes the next row group. However, RCFile does not decompress
all the loaded columns and uses a lazy decompression technique where a column
will not be decompressed in memory until RCFile has determined that the data in
the column will be really useful for query execution.
Search WWH ::




Custom Search