Discovery of Domain Values for Data Quality Assurance - Developing Concepts in Applied Intelligence

Information Technology Reference

In-Depth Information

Discovery of Domain Values for Data

Quality Assurance

Lukasz Ciszak

Institute of Computer Science, Warsaw University of Technology,

ul. Nowowiejska 15/19, 00-665 Warszawa, Poland

l.ciszak@ii.pw.edu.pl

Abstract. Data profiling is a crucial step in the data quality process as it

provides the current data quality rules. In this paper we present experimen-

tal results comparing our DOMAIN method for the discovery of domain

constraint values to the commercially available Oracle Warehouse Builder

(OWB). The experimental results prove that the effectiveness of our ap-

proach in the discovery of domain values for textual data affected by data

quality problems is greater than that offered by the OWB.

Keywords: information quality, data quality, data profiling, domain

discovery.

1

Introduction

Business intelligence systems whose core is a data warehouse require high quality

data as the input for analytical processes used to draw conclusions regarding

the future of a given enterprise and support the decision-making process. In this

paper we concentrate on the experimental comparison of the effectiveness of our

DOMAIN method for the discovery of domain constraint values (data quality

rules) to the Oracle Warehouse Builder profiler offering similar functionality.

The details of DOMAIN are covered by [3].

Data quality in an information system is defined as 'the fitness of the data for

use' [10]; ontologically, high quality data represent the Universe of Discourse

(UoD) in the way it allows us to draw conclusions about the future states

of the UoD [11]. Real-world data are always affected by various data quality prob-

lems resulting from the imperfection of data input methods (misspelling, mistyp-

ings), data transport (encoding problems), and finally the evolution of the data

source (stale and inaccurate metadata). Domain constraints assure that only

the states that represent valid states of the UoD are allowed in the information

system. Data quality assessment process discovers the data quality rules and

assesses the current quality of data in the system. The results are subsequently

consumed by the data cleansing process that improves the quality of data.

The data quality improvement is researched in the domain of constraint re-

pair and consistent query answering [1,2] and record linkage and duplicate elim-

ination [13]. The profiling methods are studied in terms of the discovery of

Search WWH ::

Custom Search

Home