Databases Reference
In-Depth Information
provenance, consistency, conciseness and relevancy as some could not be quantified
and were not perceived to be true quality indicators.
Automated. The LINK-QA framework [52] takes a set of resources, SPARQL endpoints
and
or dereferencable resources and a set of triples as input to automatically perform
quality assessment and generates an HTML report. However, using this tool, the user
cannot choose the dataset that she is interested in assessing.
/
Manual. The WIQA [24] and Sieve [100] frameworks also assess the quality of datasets
but require a considerable amount of user involvement and are therefore considered
manual tools. For instance, WIQA provides the user a wide range of policies to filter
information and a set of quality criteria to assess the quality of the information.
Sieve, on the other hand, assists not only in the assessment of the quality of datasets
but also in their fusion. It aims to use the data integration task as a means to increase
completeness, conciseness and consistency in any chosen dataset. Sieve is a component
of the Linked Data Integration Framework (LDIF) 52 used first to assess the quality
between two or more data sources and second to fuse (integrate) the data from the data
sources based on their quality assessment.
In order to use this tool, a user needs to be conversant with programming. The input
of Sieve is an LDIF provenance metadata graph generated from a data source. Based
on this information the user needs to set the configuration property in an XML file
known as integration properties. The quality assessment procedure relies on the
measurement of metrics chosen by the user where each metric applies a scoring function
having a value from 0 to 1.
Sieve implements only a few scoring functions such as TimeCloseness,
Preference, SetMembership, Threshold and Interval Membership which are
calculated based on the metadata provided as input along with the original data source.
The configuration file is in XML format which should be modified based on the use
case, as shown in Listing 1.1. The output scores are then used to fuse the data sources by
applying one of the fusion functions, which are: Filter, Average, Max, Min, First,
KeepSingleValue ByQualityScore, Last, Random, PickMostFrequent.
Listing 1.1. A configuration of Sieve: a data quality assessment and data fusion tool
<Sieve>
1
<QualityAssessment>
2
<AssessmentMetric id="sieve:recency">
3
<ScoringFunction class="TimeCloseness">
4
<Param name="timeSpan" value="7"/>
5
<Input path="?GRAPH/provenance:lasUpdated"/>
6
</ScoringFunction>
7
</AssessmentMetric>
8
<AssessmentMetric id="sieve:reputation">
9
<ScoringFunction class="ScoredList">
10
<Param name="priority" value="http://pt.wikipedia.org http://en.wikipedia.org"/>
11
<Input path="?GRAPH/provenance:lasUpdated"/>
12
</ScoringFunction>
13
</AssessmentMetric>
14
</Sieve>
15
52 http://ldif.wbsg.de/
 
Search WWH ::




Custom Search