X-HYBRIDJOIN for Near-Real-Time Data Warehousing - Advances in Databases

Database Reference

In-Depth Information

X-HYBRIDJOIN for Near-Real-Time Data

Warehousing

Muhammad Asif Naeem, Gillian Dobbie, and Gerald Weber

Department of Computer Science, The University of Auckland,

Private Bag 92019, Auckland, New Zealand

mnae006@aucklanduni.ac.nz ,

{gill,gerald}@cs.auckland.ac.nz

Abstract. In order to make timely and effective decisions, businesses

need the latest information from data warehouse repositories. To keep

these repositories up-to-date with respect to end user updates, near-real-

time data integration is required. An important phase in near-real-time

data integration is data transformation where the stream of updates is

joined with disk-based master data. The stream-based algorithm Mesh

Join (MESHJOIN) has been proposed to amortize disk access over fast

stream. MESHJOIN makes no assumptions about the data distribution.

In real world applications, however, skewed distributions can be found,

e.g, certain products are sold more frequently than the remainder of

the products. The question arises, how much does MESHJOIN loose in

terms of performance by not adapting to data skew. In this paper we per-

form a rigorous experimental study analyzing the possible performance

improvements while considering typical data distributions. For this pur-

pose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN)

that is complementary to MESHJOIN in that it can adapt to data skew

and stores parts of the master data in memory permanently, reducing

the disk access overhead significantly. We compare the performance of

X-HYBRIDJOIN against the performance of MESHJOIN. We take sev-

eral precautions to make sure the comparison is adequate and focuses

on the utilization of data skew. The experiments show that considering

data skew offers substantial room for performance gains that cannot be

used by non-adaptive approaches such as MESHJOIN.

Keywords: Near-real-time data warehousing, stream-based join, data

transformation, performance and tuning.

1

Introduction

Near-real-time data warehouse deployments are driving an evolution to more

aggressive data freshness levels. The tools and techniques for delivering these new

service levels are evolving rapidly [1] [2]. In the beginning, most data warehouses

refreshed all content fully during each load cycle. However, due to an increasing

demand for information freshness, it became infeasible to meet business needs.

Therefore the data acquisition mechanism in warehouses was changed from full

Search WWH ::

Custom Search

Home