X-HYBRIDJOIN for Near-Real-Time Data Warehousing - Advances in Databases

Database Reference

In-Depth Information

The remainder of the paper is structured as follows. A review of the related

work is presented in Section 2. In section 3 we describe the intermediate algo-

rithm HYBRIDJOIN that already uses an index in the join process. In Section 4,

we present the difference between HYBRIDJOIN and X-HYBRIDJOIN, and also

derive the cost model for X-HYBRIDJOIN. The experimental study is discussed

in Section 5 and finally Section 6 concludes the paper.

2 Related Work

In this section, we present an overview of the previous work that has been done

in this area, focusing on that which is closely related to our problem domain.

A novel algorithm Mesh Join (MESHJOIN) [8] [9] has been designed espe-

cially for joining a continuous stream with a disk-based relation, such as the

scenario in active data warehouses. Although it is an adaptive approach, there

are some research issues such as suboptimal distribution of memory among the

join components and an inecient strategy for accessing the disk-based relation.

The algorithm makes no assumptions about data distribution or the organiza-

tion of the master data. However, the problem that we are addressing is that

MESHJOIN's performance is directly coupled to the size of the master data ta-

ble, and its performance is inversely proportional to the size of the master data

table. This is an undesired behavior if the master data becomes very large, and

our analysis will show that it is indeed an unnecessary behavior. The problem

becomes obvious if we consider that the master data table contains a large part

that is never joined with the stream data.

A revised version of MESHJOIN called R-MESHJOIN (reduced Mesh Join)

[12] has been presented by us. It addresses the issue of optimal distribution

of memory among the join components. In this algorithm a new strategy for

memory distribution among the join components is introduced capturing real

constraints. However the issue of an inecient strategy for accessing the disk-

based relation still exists in R-MESHJOIN.

One approach for improving MESHJOIN has been a partition-based join al-

gorithm [10] which can also deal with stream intermittence. It uses a two-level

hash table in order to attempt to join stream tuples as soon as they arrive, and

uses a partition-based waiting area for other stream tuples. For the algorithm in

[10], however, the time that a tuple waits for execution is not bounded. We are,

however, interested in a join approach where the time in which a stream tuple

is joined is guaranteed.

3

Index-Based Hash Join Architecture: HYBRIDJOIN

In this section we introduce the HYBRIDJOIN algorithm, which implements our

first modification of MESHJOIN in order to make use of a non-clustered index.

We introduce the join architecture for HYBRIDJOIN. This will be used, with a

single modification, for the algorithm X-HYBRIDJOIN as well.

Search WWH ::

Custom Search

Home