Information Technology Reference
In-Depth Information
Fig. 3. The results of the domain discovery executed using Oracle Warehouse Builder
(OWB) and our approach on two data sets - synthetic 'American cities (left frame)
and real-world 'Polish cities' (right frame).
set and 90% of the domain values in the real-world data set while maintaining
low level of the ratio of the non-domain values identified as domain values - 1.5%
of non-domain values that covered 0.8% of relation tuples in case of synthetic
data and 34% of non-domain values covering less than 20% relation tuples in
case of the real-world data.
The OWB was less effective than DOMAIN in terms of the effectiveness mea-
sures for both data sets. In the case of the synthetic data set and the OWB(2)
parameter set, the OWB managed to identify 28 of the 50 domain values that
covered 87% of the relation tuples, whereas our method identified all the do-
main values. The OWB managed to avoid false positives for the cost of missing
almost a half of the valid domain values. When applied to the real-world data
set, the OWB profiler using the OWB(2) parameter set identified only 12 out
of 1862 domain values which is 0.64% of all the domain values which covered
57% of the relation tuples. Similarly to the synthetic data set, the ratio of false
positives was 0. In contrast, our algorithm identified 90% of the domain values
covering the 99% of tuples. The number of false positives produced by DOMAIN
covered less than 20% of the tuples.
In case of the OWB(1) parameter set, the OWB marks all the values from
the set as domain members producing 100% ratio of false positives.
Our results show than the DOMAIN method us can be used for the effective
discovery of domain values in high cardinality data sets heavily affected by data
quality issues. In contrast, the OWB profiler is more effective false positive-
wise in situations where the cardinality is low and only takes into consideration
the coverage of the examined relation by the potential domain values as in case
of high cardinality data set it can either discover either a fraction of the domain
values or denote all the values as the domain members.
Search WWH ::




Custom Search