X-HYBRIDJOIN for Near-Real-Time Data Warehousing - Advances in Databases - page 35

Database Reference

In-Depth Information

(Surrogate_key, Suplier_id,

Quantity, Total, Date)

(Product_id, Quantity, Date)

Data

Source

Data

Warehouse

Join

operator

Source table

Fact table

Product_id Quantity

Date

Surrogate_key Suplier_id Quantity

Total

Date

11

5

13-07-2010

1001

P101

5

$10.00 13-07-2010

22

3

13-07-2010

1002

O111

3

$9.00 13-07-2010

33

10

13-07-2010

1003

I222

10

$50.00 13-07-2010

(Surrogate_key,

Sale_price, Suplier_id)

Master

Data

Look-up table

Product_id Surrogate_key Product_name Sale_price Suplier_id

11

1001

Pepsi

$2.00

P101

22

1002

Orange Juice

$3.00

O111

33

1003

Ice cream

$5.00

I222

Fig. 1. An example of stream-based join

a staggered execution of the hash table build in order to load in stream tuples

more steadily. However there are some issues such as suboptimal distribution of

memory among the join components and an inecient strategy for accessing the

disk-based relation [12].

Although the MESHJOIN algorithm eciently amortizes the disk I/O cost

over fast input streams, the question remains how much potential for improve-

ment remains untapped due to the algorithm not being able to adapt to common

characteristics of real-world applications. In this paper we focus on one of the

most common characteristics, a skewed distribution. Such distributions arise in

practice, for example current economic models show that in many markets a se-

lect few products are bought with higher frequency [11]. Therefore, in the input

stream, the sales transactions related to those products are the most frequent. In

MESHJOIN, the algorithm does not consider the frequency of stream tuples, and

does not need an index structure on the master data. This can be useful in some

circumstances, but still in many other cases one obviously wants to use an index

to gain maximum performance. We propose an adaptive algorithm called Ex-

tended Hybrid Join (X-HYBRIDJOIN). The key feature of X-HYBRIDJOIN is

that the algorithm stores the most used portion of the disk-based relation, which

matches the frequent items in the stream, in memory. As a result, this reduces

the I/O cost substantially, which improves the performance of the algorithm.

X-HYBRIDJOIN has two major modifications compared with MESHJOIN.

Firstly, the hash join component of X-HYBRIDJOIN is modified so that it can

make use of an index. Secondly, X-HYBRIDJOIN caches frequently used master

data. Since we want to compare MESHJOIN and X-HYBRIDJOIN, it is impor-

tant to clarify, which change leads to the performance improvement. Therefore

we also present an intermediate step, HYBRIDJOIN, which implements only

the first modification, and we compare all three algorithms. Since our purpose

is primarily to gauge the potential of skewed distributions, we consider a very

clean, artificial dataset that exactly exhibits a well-understood type of skew, a

power law.

Next Page

Advances in Databases

Search WWH ::

Custom Search

Home