Optimizing Aggregate Query Processing in Cloud Data Warehouses - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

2 Related Work

Aggregate query processing has been studied in many research works [5]. But, as

per our knowledge, not many of them consider communication cost in optimizing

aggregate query processing. We analyzed some of the works which optimize the

aggregate query operations. Along with that knowledge, we propose our storage

structures, which will not only optimize query operations, but also communica-

tion cost overhead caused in cloud data warehouses.

Some of the earlier papers, which optimize aggregate query processing, are

[2] [14] and [22]. These papers provide optimizations by pushing down group-

by in the query tree to improve the query response time. W.Yan [22] proposed

two kinds of transformations namely, eager aggregation and lazy aggregation. In

eager aggregation, group-by operation is pushed down in the query tree, while

in lazy aggregation group-by is pushed up. We use the above transformations

of [22] in our system along with our PK-map and Tuple-index-map to generate

optimized query plan to process aggregate queries.

Order-Optimization [4], presents techniques to reduce the number of sorts

needed for query processing by finding the cover set using keys, predicates and

indexes. Since our proposed map structures are already sorted on keys, we elim-

inate most of the sort operations required for join operation on the tables.

Coloring-Away [23], proposed query plan generation using tree-coloring mech-

anism. This paper considers both communication cost and data re-partitioning,

and uses tree coloring to generate optimal query plan. In our framework, we op-

timize the query operations that cause the above mentioned query performance

problems such as aggregates and joins by doing sort and group-by on the fly.

Avoid-Sort-Groupby [24], proposed a query plan refining algorithm through

which unnecessary sorting and grouping can be eliminated from the query plan.

It uses inference strategies and order properties of the relation table to find the

unnecessary sorting or grouping. T.Neumann[19], points out that it is necessary

to consider both ordering and grouping to generate the query plan.

Cooperative-Sort [25], presented an evaluation technique for sorting tables.

This technique is for those queries that need multiple sort orders of the same

table on different attributes. This minimizes the I/O operations of successive

sort operations, which reduce the overall query cost.

Pre-computing the aggregates is proposed by many other researchers [6] [16],

which are useful for decision support systems. Decision support systems store

huge amount of historical data for analysis and decision-making. These databases

are updated less frequently (once a hour/day) on batches. This made it easy to

compute the aggregation ahead of time and store it as data cubes or materialized

views. Recently, the interval between historic and current data has been reduced

a lot. This will make it complicated and time consuming to re-compute data

cubes or materialized views every time data gets updated. Recent research by

companies like HP, Oracle and Teradata [8] [18] [21] shows new parallelization

schemes for processing joins and aggregate operations, eliminating data cubes.

So, in this paper we concentrate on optimizing aggregate queries without pre-

computation.

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home