Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

a block index, which is stored after all data blocks, stores the offset of each block

and is used to locate a specific block. In order to achieve storage efficiency, Llama

uses block-level compression by using any of the well-known compression schemes.

In order to improve the query processing and the performance of join operations,

Llama columns are formed into correlation groups to provide the basis for the

vertical partitioning of tables. In particular, it creates multiple vertical groups where

each group is defined by a collection of columns, one of them is specified as the

sorting column. Initially, when a new table is imported into the system, a basic

vertical group is created which contains all the columns of the table and sorted

by the table's primary key by default. In addition, based on statistics of query

patterns, some auxiliary groups are dynamically created or discarded to improve

the query performance. The Clydesdale system [ 73 , 157 ], a system which has been

implemented for targeting workloads where the data fits a star schema, uses CFile

for storing its fact tables. It also relies on tailored join plans and block iteration

mechanism [ 243 ] for optimizing the execution of its target workloads.

RCFile [ 146 ] (Record Columnar File) is another data placement structure that

provides column-wise storage for Hadoop file system (HDFS). In RCFile, each

table is firstly stored as horizontally partitioned into multiple row groups where each

row group is then vertically partitioned so that each column is stored independently

(Fig. 9.8 ). In particular, each table can have multiple HDFS blocks where each block

organizes records with the basic unit of a row group. Depending on the row group

size and the HDFS block size, an HDFS block can have only one or multiple row

groups. In particular, a row group contains the following three sections:

1. The sync marker which is placed in the beginning of the row group and mainly

used to separate two continuous row groups in an HDFS block.

2. A metadata header which stores the information items on how many records are

in this row group, how many bytes are in each column and how many bytes are

in each field in a column.

3. The table data section which is actually a column-store where all the fields in the

same column are stored continuously together.

RCFile utilizes a column-wise data compression within each row group and pro-

vides a lazy decompression technique to avoid unnecessary column decompression

during query execution. In particular, the metadata header section is compressed

using the RLE (Run Length Encoding) algorithm. The table data section is not

compressed as a whole unit. However, each column is independently compressed

with the Gzip compression algorithm. When processing a row group, RCFile does

not need to fully read the whole content of the row group into memory. It only

reads the metadata header and the needed columns in the row group for a given

query and thus it can skip unnecessary columns and gain the I/O advantages of a

column-store. The metadata header is always decompressed and held in memory

until RCFile processes the next row group. However, RCFile does not decompress

all the loaded columns and uses a lazy decompression technique where a column

will not be decompressed in memory until RCFile has determined that the data in

the column will be really useful for query execution.

Cloud Data Management

Search WWH ::

Custom Search

Home