Databases Reference
In-Depth Information
Data Quality
Databases often deal with data coming from multiple sources of varying quality: data could be
incomplete, inconsistent, or riddled with measurement errors. To date, several research lines and
commercial solutions have been proposed to deal with these issues in order to improve data quality.
Data inconsistencies have been initially studied by statisticians who needed to resolve dis-
crepancies arising from large statistical surveys. One of the first analyzed problems was the presence
of duplicate records of a person ( Elmagarmid et al. , 2007 ; Naumann and Herschel , 2010 ), and the
devised practical and theoretical solutions, called record linkage, allow for obtaining and linking all
the related data records, producing a unique and consistent view of that person. It was quickly under-
stood that record linkage was only one of a larger set of problems, such as wrong, missing, inaccurate,
and contradicting data, and in the late 1980's, researchers started to investigate all problems related
to data quality. This was essentially pushed by both the increasing number of scientific applications
based on large, numerical data sets and by the need to integrate data from heterogeneous sources for
business decision making.
The problem of missing data was initially studied in the context of scientific/numerical data
sets, relying on curative methods and algorithms able to rescue or normalize missing or wrong
scientific/numerical data; more recently, the focus moved to non-numerical data, in order to improve
also non-scientific data, improving exploratory queries, data integration, or the management of
inherently not high quality data sets, like information extraction from web and sensor network
applications; in addition, research activities are attempting to build general purpose uncertain data
management systems, (e.g., Boulos et al. , 2005 ).
Considering complementary goals, in order to ensure an overall data quality, the so called
“data cleaning” activity has been investigated, i.e., the process of standardizing data representation
and eliminating a wider range of errors. Data cleaning activities are about record matching and
deduplication (i.e., recognizing that two or more retrieved data elements correspond to the same
real world entity), about data standardization (i.e., adjusting data format and units), and about data
profiling (i.e., evaluating the data quality), gathering several aggregate data statistics that constitute
the data profile and ensuring that the values match up with expectations.
However, Infovis applications look at these issues in a different way, and the straightforward
adoption of the solutions proposed in the database field could be either a valid solution or represent
an obstacle to the analysis process. As an example, assume that we are dealing with a missing or
erroneous value. The database techniques foresee some curative algorithms, providing an alternative
(e.g., interpolated or statistically computed) value for the bad data, but this solution can hide an
insight: as an example, it is possible that the fact the data are missing is an insight itself, e.g., you
have discovered the wrong sensor or a person omitting a form field to evade a tax.
Multidimensional Data
Very often an Infovis application is about high dimensional data that are hard to present to the end
user. Available techniques rely either on intrinsically high dimensional visualizations, e.g., parallel
Search WWH ::




Custom Search