Discovery of Domain Values for Data Quality Assurance - Developing Concepts in Applied Intelligence

Information Technology Reference

In-Depth Information

functional,approximate and conditional dependencies [5,6,7], and by data ware-

housing literature [8,9,10]. Commercial tools for the support of profiling activities

(e.g. Informatica Data Quality , Informatica Data Explorer ) do not offer domain

discovery capabilities; in the ETL tool area, domain discovery options are avail-

able in the Oracle Warehouse Builder (OWB) . To our knowledge, apart from

ours [3] there has been no research specifically concentrated on the discovery

of domain constraints.

2DOMAINM thod

In [3] we presented a DOMAIN (DOmain Mining and repAIr in uNclean data)

method for the discovery of domain values from textual data sets heavily affected

by various data quality issues. In the remainder of this section we present briefly

the concepts and the pseudocode of DOMAIN . For the details the reader should

refer to [3].

2.1 Fundamental Concepts

Let r be a relation in scheme R =

representing certain collec-

tion of UoD objects in the way that there are no objects without information

system representation and each legit state of each object is represented unam-

biguously by a value from a domain for attribute from R . We will refer to r as

the ideal relation having the highest possible ontological quality of data. Let r

be another relation in R such that

{A 1 ,A 2 ,...,A n }

D i .

We assume that the set of values for each attribute in r is defined as

D i = D i ∪ E i ∪ N i ,where E i denotes the set of values that are the 'dam-

aged' version of correct domain values introduced by the imperfection of entry

methods or other factors, and N i is the set of noise , that is, meaningless and

random values. In the remainder of this paper we will focus on a single attribute

A and skip the subscript i . We will be considering a multiset S D =( D, m )where

m : D → N is a multiplicity function such that m ( d )=

∀A i ∈ Rdom ( A i )

⊃

|{t ∈ r : t ( A )= d}|

be a similarity relation on S D such that

d k d l iff ( sim kl ≥ ε ∧ ratio kl ≥ α ), where sim kl is the Jaro-Winkler

string similarity measure [12] equal 0 for two completely different strings and

equal 1 for two identical strings, ε is a textual similarity threshold , ratio kl =

Let

max m ( d k )

m ( d k ) is the multiplicity ratio ,and α is multiplicity ratio thresh-

old . Informally, element d k is similar to element d l in terms of the

m ( d l ) , m ( d l )

relation

if they are textually similar and element is d l is significantly more frequent in

the relation r than d k .Theset D and its subsets have the following features:

∀e ∈ E ∃

d ∈

d and

d ∈

d .

D : e

D :

∃

∀n ∈ Nn

The multiset S D and the relation

may be represented as a directed weighed

graph, where the nodes of the graph represent the elements from S D and the arcs

of the graph represent the relation

. In this graph we can distinguish two classes

of nodes - sinks , that is, nodes having only incoming arcs and isolated nodes ,

having neither incoming nor outgoing arcs.

Developing Concepts in Applied Intelligence

Search WWH ::

Custom Search

Home