Introduction to Linked Data and Its Lifecycle on the Web - Reasoning Web - page 78

Databases Reference

In-Depth Information

provenance, consistency, conciseness and relevancy as some could not be quantified

and were not perceived to be true quality indicators.

Automated. The LINK-QA framework [52] takes a set of resources, SPARQL endpoints

and

or dereferencable resources and a set of triples as input to automatically perform

quality assessment and generates an HTML report. However, using this tool, the user

cannot choose the dataset that she is interested in assessing.

/

Manual. The WIQA [24] and Sieve [100] frameworks also assess the quality of datasets

but require a considerable amount of user involvement and are therefore considered

manual tools. For instance, WIQA provides the user a wide range of policies to filter

information and a set of quality criteria to assess the quality of the information.

Sieve, on the other hand, assists not only in the assessment of the quality of datasets

but also in their fusion. It aims to use the data integration task as a means to increase

completeness, conciseness and consistency in any chosen dataset. Sieve is a component

of the Linked Data Integration Framework (LDIF) 52 used first to assess the quality

between two or more data sources and second to fuse (integrate) the data from the data

sources based on their quality assessment.

In order to use this tool, a user needs to be conversant with programming. The input

of Sieve is an LDIF provenance metadata graph generated from a data source. Based

on this information the user needs to set the configuration property in an XML file

known as integration properties. The quality assessment procedure relies on the

measurement of metrics chosen by the user where each metric applies a scoring function

having a value from 0 to 1.

Sieve implements only a few scoring functions such as TimeCloseness,

Preference, SetMembership, Threshold and Interval Membership which are

calculated based on the metadata provided as input along with the original data source.

The configuration file is in XML format which should be modified based on the use

case, as shown in Listing 1.1. The output scores are then used to fuse the data sources by

applying one of the fusion functions, which are: Filter, Average, Max, Min, First,

KeepSingleValue ByQualityScore, Last, Random, PickMostFrequent.

Listing 1.1. A configuration of Sieve: a data quality assessment and data fusion tool

<Sieve>

1

<QualityAssessment>

2

<AssessmentMetric id="sieve:recency">

3

<ScoringFunction class="TimeCloseness">

4

<Param name="timeSpan" value="7"/>

5

<Input path="?GRAPH/provenance:lasUpdated"/>

6

</ScoringFunction>

7

</AssessmentMetric>

8

<AssessmentMetric id="sieve:reputation">

9

<ScoringFunction class="ScoredList">

10

<Param name="priority" value="http://pt.wikipedia.org http://en.wikipedia.org"/>

11

<Input path="?GRAPH/provenance:lasUpdated"/>

12

</ScoringFunction>

13

</AssessmentMetric>

14

</Sieve>

15

52 http://ldif.wbsg.de/

Next Page

Search WWH ::

Custom Search

Home