Database Reference
In-Depth Information
(Surrogate_key, Suplier_id,
Quantity, Total, Date)
(Product_id, Quantity, Date)
Data
Source
Data
Warehouse
Join
operator
Source table
Fact table
Product_id Quantity
Date
Surrogate_key Suplier_id Quantity
Total
Date
11
5
13-07-2010
1001
P101
5
$10.00 13-07-2010
22
3
13-07-2010
1002
O111
3
$9.00 13-07-2010
33
10
13-07-2010
1003
I222
10
$50.00 13-07-2010
(Surrogate_key,
Sale_price, Suplier_id)
Master
Data
Look-up table
Product_id Surrogate_key Product_name Sale_price Suplier_id
11
1001
Pepsi
$2.00
P101
22
1002
Orange Juice
$3.00
O111
33
1003
Ice cream
$5.00
I222
Fig. 1. An example of stream-based join
a staggered execution of the hash table build in order to load in stream tuples
more steadily. However there are some issues such as suboptimal distribution of
memory among the join components and an inecient strategy for accessing the
disk-based relation [12].
Although the MESHJOIN algorithm eciently amortizes the disk I/O cost
over fast input streams, the question remains how much potential for improve-
ment remains untapped due to the algorithm not being able to adapt to common
characteristics of real-world applications. In this paper we focus on one of the
most common characteristics, a skewed distribution. Such distributions arise in
practice, for example current economic models show that in many markets a se-
lect few products are bought with higher frequency [11]. Therefore, in the input
stream, the sales transactions related to those products are the most frequent. In
MESHJOIN, the algorithm does not consider the frequency of stream tuples, and
does not need an index structure on the master data. This can be useful in some
circumstances, but still in many other cases one obviously wants to use an index
to gain maximum performance. We propose an adaptive algorithm called Ex-
tended Hybrid Join (X-HYBRIDJOIN). The key feature of X-HYBRIDJOIN is
that the algorithm stores the most used portion of the disk-based relation, which
matches the frequent items in the stream, in memory. As a result, this reduces
the I/O cost substantially, which improves the performance of the algorithm.
X-HYBRIDJOIN has two major modifications compared with MESHJOIN.
Firstly, the hash join component of X-HYBRIDJOIN is modified so that it can
make use of an index. Secondly, X-HYBRIDJOIN caches frequently used master
data. Since we want to compare MESHJOIN and X-HYBRIDJOIN, it is impor-
tant to clarify, which change leads to the performance improvement. Therefore
we also present an intermediate step, HYBRIDJOIN, which implements only
the first modification, and we compare all three algorithms. Since our purpose
is primarily to gauge the potential of skewed distributions, we consider a very
clean, artificial dataset that exactly exhibits a well-understood type of skew, a
power law.
 
Search WWH ::




Custom Search