Compression Schemes of High Dimensional Data for MOLAP - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

run length encoding

PV=(y 1 ,y 2 ,...,y n ) where y i are in DB and

y i ≠c.

The y i are arranged according to their logical

order in DB. No compression algorithm is ap-

plied on PV because it stores only non-constants

values. The Bit Vector BV indicates the locations

of constants and non-constants in the database.

The bit vector is

BV=(b 1 ,b 2 ,…,bn) where b i =1 if xi i ≠c and b i =0

if x i =c for 1≤ i ≤ N.

where BV consists of N bits.

The Address Vector AV is typically small and

is used as an index for searching the database. It

is stored in main memory rather than secondary

storage. In addition to efficient compression fast

forward and backward mapping between logical

and physical databases is also important. To do

this, BV is divided into subvectors of D bits each.

The subvectors are compressed independently.

This division of BV into subvectors makes the

Address Vector AV sufficiently small to store it

in main memory.

BV can be compressed by run-length encod-

ing method (also discussed in this chapter). The

division of BV into subvectors imposes a division

of the database DB into d= ⌈ N/D ⌉ sections, each

consisting of D elements. The address vector is

defined as:

AV=(a 1 ,a 2 ,a 3 ,…a d )

Where a 1 =0 and for i ³ , a i is the relative

position in PV of the last non-constant element

in the if th section of DB if such a non-constant

exists, otherwise we set a i =a i-1 .

Consider an example where DB=(1, 0, 0, 7, 0,

0, 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 13, 0, 0, 0, 0, 9, 0, 0,

37) and constants are 0 and D=5, Then

BV=(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,

1, 0, 0, 0, 0, 1, 0, 0, 1)

PV=(1, 7, 18, 13, 9, 37)

AV=(0, 2, 2, 3, 4)

Run Length Encoding (RLE) is based on the idea

to replace a long sequence of the same symbol by

a shorter sequence in a linearized array. RLE is a

very simple form of data compression in which runs

of data (that is, sequences in which the same data

value occurs in many consecutive data elements)

are stored as a single data value and count, rather

than as the original run. This is most useful on

data that contains many such runs.The sequence

of length l of a repeated symbol ' s ' is replaced by

a shorter sequence, usually containing one or more

symbols of ' s ', a length information and sometimes

an escape symbol. RLE replaces a string of repeated

symbols with a single symbol and a count ( run

length ) indicating the number of times the symbol is

repeated. For example the string “aaaabbcdeeeeef-

ghhhij” is replaced with “a4b2c1d1e5f1g1h3i1j1”.

The numbers indicate that they are values, not

symbols. One of the major disadvantages of the

classical run-length encoding is that they cannot

support updates to the database without completely

readjusting the runs. The method assumes that the

data is primarily static.

The RLE scheme is modified by Stabno &

Wrembel (2007) using the well known Huffman

coding (Dipperstein, 2008). They developed a

compression technique called Run-length Huff-

man (RLH) based on RLE. In RLH, the encoded

symbols and their corresponding bit strings are

represented as the Huffman tree. The Huffman tree

is used for both compressing and decompressing.

The distances between bits of value”1” are encoded

in this scheme. This modified run length encoding

is next compressed using Huffman encoding.

Bitmap compression

A bitmap compression scheme consists of a bitmap

and a physical database which stores the non-

constant values of a linearzed array. The bitmap

is employed to indicate the presence or absence

of non-constant data.

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home