Database Reference
In-Depth Information
run length encoding
PV=(y 1 ,y 2 ,...,y n ) where y i are in DB and
y i ≠c.
The y i are arranged according to their logical
order in DB. No compression algorithm is ap-
plied on PV because it stores only non-constants
values. The Bit Vector BV indicates the locations
of constants and non-constants in the database.
The bit vector is
BV=(b 1 ,b 2 ,…,bn) where b i =1 if xi i ≠c and b i =0
if x i =c for 1≤ i ≤ N.
where BV consists of N bits.
The Address Vector AV is typically small and
is used as an index for searching the database. It
is stored in main memory rather than secondary
storage. In addition to efficient compression fast
forward and backward mapping between logical
and physical databases is also important. To do
this, BV is divided into subvectors of D bits each.
The subvectors are compressed independently.
This division of BV into subvectors makes the
Address Vector AV sufficiently small to store it
in main memory.
BV can be compressed by run-length encod-
ing method (also discussed in this chapter). The
division of BV into subvectors imposes a division
of the database DB into d= N/D sections, each
consisting of D elements. The address vector is
defined as:
AV=(a 1 ,a 2 ,a 3 ,…a d )
Where a 1 =0 and for i ³ , a i is the relative
position in PV of the last non-constant element
in the if th section of DB if such a non-constant
exists, otherwise we set a i =a i-1 .
Consider an example where DB=(1, 0, 0, 7, 0,
0, 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 13, 0, 0, 0, 0, 9, 0, 0,
37) and constants are 0 and D=5, Then
BV=(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 1)
PV=(1, 7, 18, 13, 9, 37)
AV=(0, 2, 2, 3, 4)
Run Length Encoding (RLE) is based on the idea
to replace a long sequence of the same symbol by
a shorter sequence in a linearized array. RLE is a
very simple form of data compression in which runs
of data (that is, sequences in which the same data
value occurs in many consecutive data elements)
are stored as a single data value and count, rather
than as the original run. This is most useful on
data that contains many such runs.The sequence
of length l of a repeated symbol ' s ' is replaced by
a shorter sequence, usually containing one or more
symbols of ' s ', a length information and sometimes
an escape symbol. RLE replaces a string of repeated
symbols with a single symbol and a count ( run
length ) indicating the number of times the symbol is
repeated. For example the string “aaaabbcdeeeeef-
ghhhij” is replaced with “a4b2c1d1e5f1g1h3i1j1”.
The numbers indicate that they are values, not
symbols. One of the major disadvantages of the
classical run-length encoding is that they cannot
support updates to the database without completely
readjusting the runs. The method assumes that the
data is primarily static.
The RLE scheme is modified by Stabno &
Wrembel (2007) using the well known Huffman
coding (Dipperstein, 2008). They developed a
compression technique called Run-length Huff-
man (RLH) based on RLE. In RLH, the encoded
symbols and their corresponding bit strings are
represented as the Huffman tree. The Huffman tree
is used for both compressing and decompressing.
The distances between bits of value”1” are encoded
in this scheme. This modified run length encoding
is next compressed using Huffman encoding.
Bitmap compression
A bitmap compression scheme consists of a bitmap
and a physical database which stores the non-
constant values of a linearzed array. The bitmap
is employed to indicate the presence or absence
of non-constant data.
Search WWH ::




Custom Search