Databases Reference
In-Depth Information
In addition, users should specify the DataSource folder, the homepage element that
refers to the data source from which the entities are going to be fused. Second, the XML
file of the ImportJobs that downloads the data to the server should also be modified.
In particular, the user should set up the dumpLocation element as the location of the
dump file.
Although the tool is very useful overall, there are some drawbacks that decreases its
usability: (1) the tool is not mainly used to assess data quality for a source, but instead
to perform data fusion (integration) based on quality assessment. Therefore, the quality
assessment can be considered as an accessory that leverages the process of data fusion
through evaluation of few quality indicators; (2) it does not provide a user interface,
ultimately limiting its usage to end-users with programming skills; (3) its usage is limited
to domains providing provenance metadata associated with the data source.
LODGRefine 53 [152], a LOD-enabled version of Google Refine, is an open-source
tool for refining messy data. Although this tool is not focused on data quality assess-
ment per se, it is powerful in performing preliminary cleaning or refining of raw data.
Using this tool, one is able to import several di
ff
erent file types of data (CSV, Ex-
cel,XML,RDF
XML, N-Triples or even JSON) and then perform cleaning action via
a browser-based interface. By using a diverse set of filters and facets on individual
columns, LODGRefine can help a user to semi-automate the cleaning of her data.
For example, this tool can help to detect duplicates, discover patterns (e.g. alternative
forms of an abbreviation), spot inconsistencies (e.g. trailing white spaces) or find and
replace blank cells. Additionally, this tool allows users to reconcile data, that is to con-
nect a dataset to existing vocabularies such that it gives meaning to the values. Recon-
ciliations to Freebase 54 helps mapping ambiguous textual values to precisely identified
Freebase entities. Reconciling using Sindice or based on standard SPARQL or SPARQL
with full-text search is also possible 55 using this tool. Moreover, it is also possible to
extend the reconciled data with DBpedia as well as export the data as RDF, which adds
to the uniformity and usability of the dataset.
These feature thus assists in assessing as well as improving the data quality of a
dataset. Moreover, by providing external links, the interlinking of the dataset is consid-
erably improved. LODGRefine is easy to download and install as well as to upload and
perform basic cleansing steps on raw data. The features of reconciliation, extending the
data with DBpedia, transforming and exporting the data as RDF are added advantages.
However, this tool has a few drawbacks: (1) the user is not able to perform detailed
high level data quality analysis utilizing the various quality dimensions using this tool;
(2) performing cleansing over a large dataset is time consuming as the tool follows a
column data model and thus the user must perform transformations per column.
/
8
Outlook and Future Challenges
Although the di
erent approaches for aspects of the Linked Data life-cycle as pre-
sented in this chapter are already working together, more e
ff
ff
ort must be done to further
53 http://code.zemanta.com/sparkica/
54 http://www.freebase.com/
55 http://refine.deri.ie/reconciliationDocs
Search WWH ::




Custom Search