X-HYBRIDJOIN for Near-Real-Time Data Warehousing - Advances in Databases

Database Reference

In-Depth Information

(line 1). In the beginning all slots in the hash table H are empty; therefore,

h S is assigned to w (line 2). In the abstract level description, the algorithm

contains two kind of loops. One is called the outer loop, which is an endless

loop (line 3). The key objective of the outer loop is to build the stream in the

hash table. Within the outer loop, the algorithm runs two independent inner

loops. One loop implements the probing module for the non-swappable part of

the disk buffer, while the other inner loop implements the probing of the swap-

pable part of the disk buffer. As the outer loop begins, the algorithm observes

the status of the stream buffer. If stream is available the algorithm reads the

w tuples from the stream buffer and loads them into the hash table, while also

enqueuing their attribute values into the queue. After completing the stream

input the algorithm resets w to 0 (lines 4 to 7). The algorithm then executes the

first inner loop, in which it reads all tuples one-by-one from the non-swappable

part of the disk buffer and looks them up into the hash table. In the case of

a match, the algorithm generates the join output. Due to the multi-hash-map,

there can be more than one match against one disk tuple. After generating the

join output the algorithm deletes all matched tuples from the hash table, along

with the corresponding nodes from the queue. The algorithm also increments w

with the number of vacated slots in the hash table (lines 8 to 14). Before starting

the second inner loop, the algorithm reads the oldest value of the join attribute

from the queue (line 3) and loads a disk partition p i (where 2

n )intothe

swappable part of the disk buffer, using that join attribute value as an index

(lines 15 and 16). As the specified disk partition is loaded into the swappable

part of the disk buffer, the algorithm starts the second inner loop and repeats

all the steps described in the first inner loop (lines 17 to 23).

i

≤

Note: If we switch-off the first inner loop (lines 8 to 14), the algorithm works

as HYBRIDJOIN.

4.3

Cost Model

In this section we derive the general formulae for calculating the cost for our

proposed X-HYBRIDJOIN. We derive equations for memory and processing time

of X-HYBRIDJOIN. Equation 1 describes the total memory used to implement

the algorithm except for the stream buffer; whereas Equation 2 calculates the

processing cost for w tuples. The symbols used to measure the costs are specified

in Table 1.

Memory cost. In X-HYBRIDJOIN, the disk buffer is divided into two equal

parts. One is swappable, the other is non-swappable. As said before, the largest

share of the total memory is used for the hash table; a much smaller portion is

used for the disk buffer. The queue size is a constant fraction of the hash table

size. The memory for each component of X-HYBRIDJOIN can be calculated as

shown below.

Memory reserved for the swappable and non-swappable parts= v P + v P =2 v P

(in the case of HYBRIDJOIN it is v P only).

Memory for the hash table = α ( M −

2 v P ).

Advances in Databases

Search WWH ::

Custom Search

Home