X-HYBRIDJOIN for Near-Real-Time Data Warehousing - Advances in Databases

Database Reference

In-Depth Information

HYBRIDJOIN joins a disk-based relation R with a stream S . We assume a

non-clustered index on R for the join attribute, and we assume that the join

attribute is unique within the master data. This is a very natural set of as-

sumptions and matches with common application domains, for example in key

exchange applications. By only requiring a non-clustered index, we keep our

assumptions as minimal as possible.

The memory architecture used in HYBRIDJOIN and in X-HYBRIDJOIN is

shown in Figure 2. The main memory components are a disk buffer, a hash table,

a queue and a stream buffer while the disk-based relation R and stream S are the

inputs. In our algorithm, we assume that R has an index with the join attribute

as the key.

The stream is used as the build input. This means that the algorithm keeps

stream tuples in a hash table which occupies the largest share of the memory, and

the hash table is filled with the next pending stream tuples to its full capacity.

Additionally we keep identifiers of the stream tuples in a queue which allows

random deletion, the simplest implementation is a doubly linked list.

HYBRIDJOIN is an iterative algorithm, and in each iteration it uses a parti-

tion of the disk-based relation R as a probe input. For that purpose, the partition

is loaded into the disk buffer. In HYBRIDJOIN, the disk buffer contains only

this partition, later in X-HYBRIDJOIN the partition will only occupy one part

of the disk buffer. After that, the algorithm performs the typical operation of a

hash join, i.e., it loops over all the tuples of the disk buffer and looks them up

in the hash table. In the case of a match, the algorithm generates that stream

tuple as an output.

Also, in each iteration, HYBRIDJOIN evicts stream tuples that have been

matched. This is justified through the assumption that the join attribute is

unique in R . Evicting a tuple means it is deleted from the hash table and the

queue. The algorithm also keeps a counter w of the evicted tuples. After pro-

cessing the whole disk buffer, the algorithm reads w new tuples from the stream

buffer, loads them in the hash table along with entering their identifiers in the

queue.

For choosing the next partition of R , HYBRIDJOIN looks at the join attribute

of the oldest stream tuple in the queue. Using the index, it loads the partition

of R with that join attribute value into the disk buffer. It is this last step which

makes HYBRIDJOIN adaptive, because in HYBRIDJOIN, every loaded parti-

tion matches at least one stream tuple. As a simple example, consider R has a

section that is not referred to in the stream, for example an obsolete group of

products. In MESHJOIN, this section would still be loaded, while in HYBRID-

JOIN it would not be loaded, because no stream tuple will trigger the loading

of that section.

HYBRIDJOIN works for any data distribution, as MESHJOIN does. How-

ever, in practice, certain distributions are common. Current research has shown

that sales data typically follows a power law, or Zipfian distribution [11]. The

power law is characterized by its exponent. For an exponent < 1 the distribu-

tion is said to have a long tail, for an exponent > 1 the distribution has a short

Advances in Databases

Search WWH ::

Custom Search

Home