Information Technology Reference
In-Depth Information
functional,approximate and conditional dependencies [5,6,7], and by data ware-
housing literature [8,9,10]. Commercial tools for the support of profiling activities
(e.g. Informatica Data Quality , Informatica Data Explorer ) do not offer domain
discovery capabilities; in the ETL tool area, domain discovery options are avail-
able in the Oracle Warehouse Builder (OWB) . To our knowledge, apart from
ours [3] there has been no research specifically concentrated on the discovery
of domain constraints.
2DOMAINM thod
In [3] we presented a DOMAIN (DOmain Mining and repAIr in uNclean data)
method for the discovery of domain values from textual data sets heavily affected
by various data quality issues. In the remainder of this section we present briefly
the concepts and the pseudocode of DOMAIN . For the details the reader should
refer to [3].
2.1 Fundamental Concepts
Let r be a relation in scheme R =
representing certain collec-
tion of UoD objects in the way that there are no objects without information
system representation and each legit state of each object is represented unam-
biguously by a value from a domain for attribute from R . We will refer to r as
the ideal relation having the highest possible ontological quality of data. Let r
be another relation in R such that
{A 1 ,A 2 ,...,A n }
D i .
We assume that the set of values for each attribute in r is defined as
D i = D i ∪ E i ∪ N i ,where E i denotes the set of values that are the 'dam-
aged' version of correct domain values introduced by the imperfection of entry
methods or other factors, and N i is the set of noise , that is, meaningless and
random values. In the remainder of this paper we will focus on a single attribute
A and skip the subscript i . We will be considering a multiset S D =( D, m )where
m : D → N is a multiplicity function such that m ( d )=
∀A i ∈ Rdom ( A i )
|{t ∈ r : t ( A )= d}|
.
be a similarity relation on S D such that
d k d l iff ( sim kl ≥ ε ∧ ratio kl ≥ α ), where sim kl is the Jaro-Winkler
string similarity measure [12] equal 0 for two completely different strings and
equal 1 for two identical strings, ε is a textual similarity threshold , ratio kl =
Let
max m ( d k )
m ( d k ) is the multiplicity ratio ,and α is multiplicity ratio thresh-
old . Informally, element d k is similar to element d l in terms of the
m ( d l ) , m ( d l )
relation
if they are textually similar and element is d l is significantly more frequent in
the relation r than d k .Theset D and its subsets have the following features:
∀e ∈ E ∃
d ∈
d and
d ∈
d .
D : e
D :
∀n ∈ Nn
The multiset S D and the relation
may be represented as a directed weighed
graph, where the nodes of the graph represent the elements from S D and the arcs
of the graph represent the relation
. In this graph we can distinguish two classes
of nodes - sinks , that is, nodes having only incoming arcs and isolated nodes ,
having neither incoming nor outgoing arcs.
 
Search WWH ::




Custom Search