Information Technology Reference
In-Depth Information
3 Experimental Results
We performed the experiments on two datasets - a synthetic 'American cities'
dataset and a real-life 'Polish cities' dataset. The latter data set consists of
the names of Polish cities extracted from an Internet survey and contains 2199
distinct values (28429 in total) of which 1862 were domain values (valid Polish
city names), covering 27279 elements of the data set. The synthetic 'Ameri-
can cities' data set was created using the names of 50 of the largest American
cities as the domain values. The incorrect variants of these values were gener-
ated according to the distribution of the textual errors in the former data set.
After these operations, there were 1350 distinct values (24749 total values) and
the domain values covered 22516 values from the data set. The purpose of the
experiments was to compare the domain discovery effectiveness of DOMAIN to
the effectiveness offered by the OWB.
The effectiveness of domain discovery was assessed using two measures. The
first measure p D is the ratio of discovered domain values to the all domain values
in the given dataset; the second measure p nD is the ratio of the non-domain
values classified as domain values to the all non-domain values in the given data
set. Both measures are also expressed in an absolute form, where we take into
consideration the number of occurrences of the value within the data set; defined
respectively as P D and P nD .
D|
D )
p D = |D
( D
P D = |{t ∈ r | t ( A )
}|
D|
D}|
|
|{t ∈ r
|t ( A )
p nD = |D
( D
( N ∪ E )
|
P nD = |{t ∈ r | t ( A )
( N ∪ E ))
}|
|N ∪ E|
|{t ∈ r
| t ( A )
( N ∪ E )
}|
3.1 Results
We performed the experiments on the two aforementioned data sets for ε =0 . 87
and α = 2 obtained using the results described in [4] and yielding optimal
results in terms of the effectiveness measures. For comparison, we ran the Oracle
Warehouse Builder profiler on the same data sets and chose the optimal results
obtained using this tool. In our experiments, we chose two sets of the driving
parameters, denoted respectively by OWB(1) and OWB(2). In both sets, we
set the limit for the maximal number of discovered values to 10000 to assure
the discovery of all the domain values. In case of OWB(1) the lower limits were
set to 0 to allow the discovery of infrequent values; in case of OWB(2), we set
the lower limits to 1 hence limiting the discovered values only to those whose
frequency of appearance in the data set was greater than 1%.
The results of the experiments are presented in Fig. 3.
The experimental results prove that our approach is capable of effective discov-
ery of domain values for textual attributes heavily affected (10%) by typographic
errors. DOMAIN managed to discover all the domain values in the synthetic data
 
Search WWH ::




Custom Search