Information Technology Reference
In-Depth Information
4 Conclusions
In this paper, we have presented the results of experiments performed using
the DOMAIN method applied to synthetic and real-world data. The results
show that DOMAIN can effectively discover domainvalueswithhighaccuracy
and a low ratio of false positives in data sets suffering from data quality issues.
The method has been compared to the commercial solution offered by the Or-
acle Warehouse Builder and showed greater effectiveness while also maintain-
ing a higher level of flexibility. The presented method can be used to support
the data profiling stage of the data quality assessment and improvement pro-
cess via the automatization of work needed to obtain the up-to-date metadata
describing a given dataset.
References
1. Arieli, O., Denecker, M., Bruynooghe, M.: Distance Semantics for Database Repair.
Annals of Mathematics and Artificial Intelligence 50(3-4), 389-415 (2007)
2. Ceri, S., Giunta, F.D., Lanzi, P.L.: Mining Constraint Violations. ACM Transac-
tions on Database Systems 32(1), 1-32 (2007)
3. Ciszak, L.: A method for automatic discovery of reference data. In: Chien, B.-
C., Hong, T.-P., Chen, S.-M., Ali, M. (eds.) IEA/AIE 2009. LNCS, vol. 5579,
pp. 797-805. Springer, Heidelberg (2009)
4. Ciszak, L.: Experimental Comparison of String Similarity Measures for Data Clean-
ing. In: Proceedings of the 3rd National Scientific Conference on Data Processing
Techniques, KKNTPD (2010)
5. Engle, J.T., Robertson, E.L.: HLS: Tunable Mining of Approximate Functional De-
pendencies. In: Gray, A., Jeffery, K., Shao, J. (eds.) BNCOD 2008. LNCS, vol. 5071,
pp. 28-39. Springer, Heidelberg (2008)
6. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional Functional Depen-
dencies for Capturing Data Inconsistencies. ACM Transactions on Database Sys-
tems 33(2), 1-48 (2008)
7. Huhtala, Y., Porkka, P., Toivonen, H.: TANE: An Ecient Algorithm for Discov-
ering Functional and Approximate Dependencies. The Computer Journal 42(2),
100-111 (1999)
8. Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques
for Extracting, Cleaning, 1st edn. Wiley, Chichester (2004)
9. Lindsey, E.: Three-Dimensional Analysis. Data Profiling Techniques. Data Profiling
LLC (2008)
10. Maydanchik, A.: Data Quality Assessment. Technics Publications, LLC (2007)
11. Wand, Y., Yang, R.D.: Anchoring data quality dimensions in ontological founda-
tions. Communications of the ACM 39, 86-95 (1996)
12. Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the
Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey
Research Methods (1990)
13. Winkler, W.E.: Overview of Record Linkage and Current Research Directions.
Tech. rep., Statistical Research Division U.S. Census Bureau (2006)
 
Search WWH ::




Custom Search