Optimizing Aggregate Query Processing in Cloud Data Warehouses - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

problem [7]. One of the main reasons is that we need to update global indexes

(such as DHTs) every time the data is moved (or changed), or new machine

is added. For example, in a ring network like Cassandra architecture, new ma-

chines are added to the ring near the bottlenecked machine, and the data is

redistributed between that machine and its new neighbor.

One of the important properties of cloud architecture is elastic scalability. It

must therefore support scale-out, where the responsibility of query processing

(and the corresponding data) is distributed among multiple nodes to achieve

higher throughput. Such network needs good storage structures, which reduce

the dependency of data distribution on other machines of the cluster [3].

Our proposed PK-map and Tuple-index-maps are fully decentralized, where

no predefined limits are imposed on the sizes of the network or data distribution.

We only store information on relationship between the data and not their loca-

tions. It is independent of the scale-out of data or the remote data distribution.

So, even if data is relocated remotely anywhere in the system, we do not have

to update our structures. In our previous work [17], we have showed how our

proposed method decreases the interdependency of machines (containing related

data) while optimizing the join operation in the query processing. In this paper

we will show how we optimize the aggregate query with join operations.

Aggregate operation is one of the expensive operations in distributed query

processing due to the requirement of sort and group-by operations to find the

aggregated result. Most of the earlier research work considers communication

cost as cheaper of all other operations of the query. 1 Hence two-way optimization

algorithm is most commonly used, where query optimizer first generates the plan

assuming that the query is processed in local machine and, then optimizes the

plan considering the distributed architecture of query execution.

Due to virtualization nature of a cloud environment, data storage and query

processing have become physically more distributed to meet the resource avail-

ability or the customers service agreement [9]. This gives rise to increase in

node-to-node communication. In such scenario, even if the overall response time

of distributed query is decreased, it is possible that communication cost exceeds

other query operator costs. So, it is better to give more consideration to both

query operators as well as communication cost while generating the plan.

In this paper, we use our map structures and integrity constraint inference to

push down or pull up group-by functions while generating an aggregate query

plan. During query processing, we eliminate most of the sort and group-by op-

erations by having sorted map structures and enforcing a certain order based on

the hierarchy of the tables in the schema (reference graph of [17]). We will show

how we only access more relevant data from the database tables.

The remainder of this paper is organized as follows: Section 2 provides a liter-

ature review on aggregate query processing and optimization. Section 3 explains

our proposed framework. Section 4 shows the performance evaluation using Plan-

etLab Cloud. Finally, Section 5 states the conclusion.

1 Communication cost includes costs per message, costs to transfer data and CPU

costs to pack, unpack, and process messages at the sending and receiving sites.

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home