Database Reference
In-Depth Information
problem [7]. One of the main reasons is that we need to update global indexes
(such as DHTs) every time the data is moved (or changed), or new machine
is added. For example, in a ring network like Cassandra architecture, new ma-
chines are added to the ring near the bottlenecked machine, and the data is
redistributed between that machine and its new neighbor.
One of the important properties of cloud architecture is elastic scalability. It
must therefore support scale-out, where the responsibility of query processing
(and the corresponding data) is distributed among multiple nodes to achieve
higher throughput. Such network needs good storage structures, which reduce
the dependency of data distribution on other machines of the cluster [3].
Our proposed PK-map and Tuple-index-maps are fully decentralized, where
no predefined limits are imposed on the sizes of the network or data distribution.
We only store information on relationship between the data and not their loca-
tions. It is independent of the scale-out of data or the remote data distribution.
So, even if data is relocated remotely anywhere in the system, we do not have
to update our structures. In our previous work [17], we have showed how our
proposed method decreases the interdependency of machines (containing related
data) while optimizing the join operation in the query processing. In this paper
we will show how we optimize the aggregate query with join operations.
Aggregate operation is one of the expensive operations in distributed query
processing due to the requirement of sort and group-by operations to find the
aggregated result. Most of the earlier research work considers communication
cost as cheaper of all other operations of the query. 1 Hence two-way optimization
algorithm is most commonly used, where query optimizer first generates the plan
assuming that the query is processed in local machine and, then optimizes the
plan considering the distributed architecture of query execution.
Due to virtualization nature of a cloud environment, data storage and query
processing have become physically more distributed to meet the resource avail-
ability or the customers service agreement [9]. This gives rise to increase in
node-to-node communication. In such scenario, even if the overall response time
of distributed query is decreased, it is possible that communication cost exceeds
other query operator costs. So, it is better to give more consideration to both
query operators as well as communication cost while generating the plan.
In this paper, we use our map structures and integrity constraint inference to
push down or pull up group-by functions while generating an aggregate query
plan. During query processing, we eliminate most of the sort and group-by op-
erations by having sorted map structures and enforcing a certain order based on
the hierarchy of the tables in the schema (reference graph of [17]). We will show
how we only access more relevant data from the database tables.
The remainder of this paper is organized as follows: Section 2 provides a liter-
ature review on aggregate query processing and optimization. Section 3 explains
our proposed framework. Section 4 shows the performance evaluation using Plan-
etLab Cloud. Finally, Section 5 states the conclusion.
1 Communication cost includes costs per message, costs to transfer data and CPU
costs to pack, unpack, and process messages at the sending and receiving sites.
 
Search WWH ::




Custom Search