Database Reference
In-Depth Information
6.7.2 Scalability with Increasing Joins ......................................................209
6.7.3 Scalability with Increasing Cluster Size ........................................... 210
6.8 Intraquery Scan Sharing for NTGA Execution Plans .................................. 210
6.8.1 Scan-Sharing Strategies for Efficient Processing of Repeated
Properties .......................................................................................... 210
6.8.1.1 Classification of Triplegroups ............................................ 212
6.8.1.2 Implementation of Clone Operation .................................. 213
6.8.2 Case Study: Impact of Scan-Sharing for Graph Pattern Queries
with Repeated Properties .................................................................. 213
6.8.2.1 Setup and Testbed .............................................................. 214
6.8.2.2 Varying Number of Repeated Properties across a Query ..... 214
6.8.2.3 Varying Size of RDF Graphs ............................................. 215
6.9 Nesting-Aware Physical Operators to Minimize Data Transfer Costs in
NTGA Execution Plans ................................................................................ 216
6.9.1 Unnesting Strategies for Efficient Management of Multivalued
Properties .......................................................................................... 218
6.9.2 Case Study: Impact of Nesting and Lazy Unnesting Strategies
for Graph Pattern Queries with Multivalued Properties ................... 223
6.10 Concluding Remarks .................................................................................... 224
References .............................................................................................................. 225
6.1 INTRODUCTION
The growing success of the Semantic Web and Web of data initiatives has ushered in
the era of “Big Semantic Web Data.” Data sets such as the Billion Triple Challenge
[2] are in the order of billions of triples and scientific data collections like the Open
Science Data cloud [3] are approaching petabyte scale. A crucial question now is
how to meet the scalability challenges of processing such data collections. Further,
emerging applications are introducing nontraditional scalability requirements where
scalability needs are elastic, varying significantly at different periods. For example,
a biologist may want to analyze their protein data by linking to other publicly avail-
able related data. This data maybe from their domain, or other domains, for example,
data about chemical compounds for helping interdisciplinary research is increas-
ingly demanding such holistic perspectives on data.
A biologist may not have the resources for locally storing and managing the large
amounts of biological data available on the Web (data sets like Uniprot are updated
monthly), nor may they be interested in managing data from other disciplines, for
example, chemistry, locally. To satisfy the needs of these applications, cloud data
services are increasing in popularity, and while most are still in their nascent phases,
significant activity is on to increase usability and performance of these systems.
Many cloud data services are based on the MapReduce programming model [12]
or other similar models, made popular by Google's expository [12] on their data-
processing stack. Its attractiveness is the simplicity of its programming model which
helps usability, its ability to support clusters made from commodity-grade machines,
making it inexpensive. An open-source implementation of MapReduce called
Hadoop [9] is now available. Hadoop-based extensions such as Apache Pig [32] and
Search WWH ::




Custom Search