Database Reference
In-Depth Information
HYBRIDJOIN joins a disk-based relation R with a stream S . We assume a
non-clustered index on R for the join attribute, and we assume that the join
attribute is unique within the master data. This is a very natural set of as-
sumptions and matches with common application domains, for example in key
exchange applications. By only requiring a non-clustered index, we keep our
assumptions as minimal as possible.
The memory architecture used in HYBRIDJOIN and in X-HYBRIDJOIN is
shown in Figure 2. The main memory components are a disk buffer, a hash table,
a queue and a stream buffer while the disk-based relation R and stream S are the
inputs. In our algorithm, we assume that R has an index with the join attribute
as the key.
The stream is used as the build input. This means that the algorithm keeps
stream tuples in a hash table which occupies the largest share of the memory, and
the hash table is filled with the next pending stream tuples to its full capacity.
Additionally we keep identifiers of the stream tuples in a queue which allows
random deletion, the simplest implementation is a doubly linked list.
HYBRIDJOIN is an iterative algorithm, and in each iteration it uses a parti-
tion of the disk-based relation R as a probe input. For that purpose, the partition
is loaded into the disk buffer. In HYBRIDJOIN, the disk buffer contains only
this partition, later in X-HYBRIDJOIN the partition will only occupy one part
of the disk buffer. After that, the algorithm performs the typical operation of a
hash join, i.e., it loops over all the tuples of the disk buffer and looks them up
in the hash table. In the case of a match, the algorithm generates that stream
tuple as an output.
Also, in each iteration, HYBRIDJOIN evicts stream tuples that have been
matched. This is justified through the assumption that the join attribute is
unique in R . Evicting a tuple means it is deleted from the hash table and the
queue. The algorithm also keeps a counter w of the evicted tuples. After pro-
cessing the whole disk buffer, the algorithm reads w new tuples from the stream
buffer, loads them in the hash table along with entering their identifiers in the
queue.
For choosing the next partition of R , HYBRIDJOIN looks at the join attribute
of the oldest stream tuple in the queue. Using the index, it loads the partition
of R with that join attribute value into the disk buffer. It is this last step which
makes HYBRIDJOIN adaptive, because in HYBRIDJOIN, every loaded parti-
tion matches at least one stream tuple. As a simple example, consider R has a
section that is not referred to in the stream, for example an obsolete group of
products. In MESHJOIN, this section would still be loaded, while in HYBRID-
JOIN it would not be loaded, because no stream tuple will trigger the loading
of that section.
HYBRIDJOIN works for any data distribution, as MESHJOIN does. How-
ever, in practice, certain distributions are common. Current research has shown
that sales data typically follows a power law, or Zipfian distribution [11]. The
power law is characterized by its exponent. For an exponent < 1 the distribu-
tion is said to have a long tail, for an exponent > 1 the distribution has a short
 
Search WWH ::




Custom Search