Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

6.7.2 Scalability with Increasing Joins ......................................................209

6.7.3 Scalability with Increasing Cluster Size ........................................... 210

6.8 Intraquery Scan Sharing for NTGA Execution Plans .................................. 210

6.8.1 Scan-Sharing Strategies for Efficient Processing of Repeated

Properties .......................................................................................... 210

6.8.1.1 Classification of Triplegroups ............................................ 212

6.8.1.2 Implementation of Clone Operation .................................. 213

6.8.2 Case Study: Impact of Scan-Sharing for Graph Pattern Queries

with Repeated Properties .................................................................. 213

6.8.2.1 Setup and Testbed .............................................................. 214

6.8.2.2 Varying Number of Repeated Properties across a Query ..... 214

6.8.2.3 Varying Size of RDF Graphs ............................................. 215

6.9 Nesting-Aware Physical Operators to Minimize Data Transfer Costs in

NTGA Execution Plans ................................................................................ 216

6.9.1 Unnesting Strategies for Efficient Management of Multivalued

Properties .......................................................................................... 218

6.9.2 Case Study: Impact of Nesting and Lazy Unnesting Strategies

for Graph Pattern Queries with Multivalued Properties ................... 223

6.10 Concluding Remarks .................................................................................... 224

References .............................................................................................................. 225

6.1 INTRODUCTION

The growing success of the Semantic Web and Web of data initiatives has ushered in

the era of “Big Semantic Web Data.” Data sets such as the Billion Triple Challenge

[2] are in the order of billions of triples and scientific data collections like the Open

Science Data cloud [3] are approaching petabyte scale. A crucial question now is

how to meet the scalability challenges of processing such data collections. Further,

emerging applications are introducing nontraditional scalability requirements where

scalability needs are elastic, varying significantly at different periods. For example,

a biologist may want to analyze their protein data by linking to other publicly avail-

able related data. This data maybe from their domain, or other domains, for example,

data about chemical compounds for helping interdisciplinary research is increas-

ingly demanding such holistic perspectives on data.

A biologist may not have the resources for locally storing and managing the large

amounts of biological data available on the Web (data sets like Uniprot are updated

monthly), nor may they be interested in managing data from other disciplines, for

example, chemistry, locally. To satisfy the needs of these applications, cloud data

services are increasing in popularity, and while most are still in their nascent phases,

significant activity is on to increase usability and performance of these systems.

Many cloud data services are based on the MapReduce programming model [12]

or other similar models, made popular by Google's expository [12] on their data-

processing stack. Its attractiveness is the simplicity of its programming model which

helps usability, its ability to support clusters made from commodity-grade machines,

making it inexpensive. An open-source implementation of MapReduce called

Hadoop [9] is now available. Hadoop-based extensions such as Apache Pig [32] and

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home