Beyond Querying - User-Centered Data Management

Databases Reference

In-Depth Information

Data Quality

Databases often deal with data coming from multiple sources of varying quality: data could be

incomplete, inconsistent, or riddled with measurement errors. To date, several research lines and

commercial solutions have been proposed to deal with these issues in order to improve data quality.

Data inconsistencies have been initially studied by statisticians who needed to resolve dis-

crepancies arising from large statistical surveys. One of the first analyzed problems was the presence

of duplicate records of a person ( Elmagarmid et al. , 2007 ; Naumann and Herschel , 2010 ), and the

devised practical and theoretical solutions, called record linkage, allow for obtaining and linking all

the related data records, producing a unique and consistent view of that person. It was quickly under-

stood that record linkage was only one of a larger set of problems, such as wrong, missing, inaccurate,

and contradicting data, and in the late 1980's, researchers started to investigate all problems related

to data quality. This was essentially pushed by both the increasing number of scientific applications

based on large, numerical data sets and by the need to integrate data from heterogeneous sources for

business decision making.

The problem of missing data was initially studied in the context of scientific/numerical data

sets, relying on curative methods and algorithms able to rescue or normalize missing or wrong

scientific/numerical data; more recently, the focus moved to non-numerical data, in order to improve

also non-scientific data, improving exploratory queries, data integration, or the management of

inherently not high quality data sets, like information extraction from web and sensor network

applications; in addition, research activities are attempting to build general purpose uncertain data

management systems, (e.g., Boulos et al. , 2005 ).

Considering complementary goals, in order to ensure an overall data quality, the so called

“data cleaning” activity has been investigated, i.e., the process of standardizing data representation

and eliminating a wider range of errors. Data cleaning activities are about record matching and

deduplication (i.e., recognizing that two or more retrieved data elements correspond to the same

real world entity), about data standardization (i.e., adjusting data format and units), and about data

profiling (i.e., evaluating the data quality), gathering several aggregate data statistics that constitute

the data profile and ensuring that the values match up with expectations.

However, Infovis applications look at these issues in a different way, and the straightforward

adoption of the solutions proposed in the database field could be either a valid solution or represent

an obstacle to the analysis process. As an example, assume that we are dealing with a missing or

erroneous value. The database techniques foresee some curative algorithms, providing an alternative

(e.g., interpolated or statistically computed) value for the bad data, but this solution can hide an

insight: as an example, it is possible that the fact the data are missing is an insight itself, e.g., you

have discovered the wrong sensor or a person omitting a form field to evade a tax.

Multidimensional Data

Very often an Infovis application is about high dimensional data that are hard to present to the end

user. Available techniques rely either on intrinsically high dimensional visualizations, e.g., parallel

Search WWH ::

Custom Search

Home