Information Technology Reference
In-Depth Information
Discovery of Domain Values for Data
Quality Assurance
Lukasz Ciszak
Institute of Computer Science, Warsaw University of Technology,
ul. Nowowiejska 15/19, 00-665 Warszawa, Poland
l.ciszak@ii.pw.edu.pl
Abstract. Data profiling is a crucial step in the data quality process as it
provides the current data quality rules. In this paper we present experimen-
tal results comparing our DOMAIN method for the discovery of domain
constraint values to the commercially available Oracle Warehouse Builder
(OWB). The experimental results prove that the effectiveness of our ap-
proach in the discovery of domain values for textual data affected by data
quality problems is greater than that offered by the OWB.
Keywords: information quality, data quality, data profiling, domain
discovery.
1
Introduction
Business intelligence systems whose core is a data warehouse require high quality
data as the input for analytical processes used to draw conclusions regarding
the future of a given enterprise and support the decision-making process. In this
paper we concentrate on the experimental comparison of the effectiveness of our
DOMAIN method for the discovery of domain constraint values (data quality
rules) to the Oracle Warehouse Builder profiler offering similar functionality.
The details of DOMAIN are covered by [3].
Data quality in an information system is defined as 'the fitness of the data for
use' [10]; ontologically, high quality data represent the Universe of Discourse
(UoD) in the way it allows us to draw conclusions about the future states
of the UoD [11]. Real-world data are always affected by various data quality prob-
lems resulting from the imperfection of data input methods (misspelling, mistyp-
ings), data transport (encoding problems), and finally the evolution of the data
source (stale and inaccurate metadata). Domain constraints assure that only
the states that represent valid states of the UoD are allowed in the information
system. Data quality assessment process discovers the data quality rules and
assesses the current quality of data in the system. The results are subsequently
consumed by the data cleansing process that improves the quality of data.
The data quality improvement is researched in the domain of constraint re-
pair and consistent query answering [1,2] and record linkage and duplicate elim-
ination [13]. The profiling methods are studied in terms of the discovery of
 
Search WWH ::




Custom Search