Databases Reference
In-Depth Information
(c) The automatic generation of a concept hierarchy for numeric data based on the
equal-frequency partitioning rule.
3.14 Robust data loading poses a challenge in database systems because the input data are
often dirty. In many cases, an input record may miss multiple values; some records
could be contaminated , with some data values out of range or of a different data type
than expected. Work out an automated data cleaning and loading algorithm so that the
erroneous data will be marked and contaminated data will not be mistakenly inserted
into the database during data loading.
3.8 Bibliographic Notes
Data preprocessing is discussed in a number of textbooks, including English [Eng99],
Pyle [Pyl99], Loshin [Los01], Redman [Red01], and Dasu and Johnson [DJ03]. More
specific references to individual preprocessing techniques are given later.
For discussion regarding data quality, see Redman [Red92]; Wang, Storey, and
Firth [WSF95]; Wand and Wang [WW96]; Ballou and Tayi [BT99]; and Olson [Ols03].
Potter's Wheel ( control.cx.berkely.edu/abc ), the interactive data cleaning tool described in
Section 3.2.3, is presented in Raman and Hellerstein [RH01]. An example of the devel-
opment of declarative languages for the specification of data transformation operators is
given in Galhardas et al. [GFS C 01]. The handling of missing attribute values is discussed
in Friedman [Fri77]; Breiman, Friedman, Olshen, and Stone [BFOS84]; and Quinlan
[Qui89]. Hua and Pei [HP07] presented a heuristic approach to cleaning disguised miss-
ing data , where such data are captured when users falsely select default values on forms
(e.g., “January 1” for birthdate ) when they do not want to disclose personal information.
A method for the detection of outlier or “garbage” patterns in a handwritten char-
acter database is given in Guyon, Matic, and Vapnik [GMV96]. Binning and data
normalization are treated in many texts, including Kennedy et al. [KLV C 98], Weiss
and Indurkhya [WI98], and Pyle [Pyl99]. Systems that include attribute (or feature)
construction include BACON by Langley, Simon, Bradshaw, and Zytkow [LSBZ87];
Stagger by Schlimmer [Sch86]; FRINGE by Pagallo [Pag89]; and AQ17-DCI by Bloe-
dorn and Michalski [BM98]. Attribute construction is also described in Liu and Motoda
[LM98a, LM98b]. Dasu et al. built a BELLMAN system and proposed a set of interesting
methods for building a data quality browser by mining database structures [DJMS02].
A good survey of data reduction techniques can be found in Barbar a et al. [BDF C 97].
For algorithms on data cubes and their precomputation, see Sarawagi and Stonebraker
[SS94]; Agarwal et al. [AAD C 96]; Harinarayan, Rajaraman, and Ullman [HRU96]; Ross
and Srivastava [RS97]; and Zhao, Deshpande, and Naughton [ZDN97]. Attribute sub-
set selection (or feature subset selection ) is described in many texts such as Neter, Kutner,
Nachtsheim, and Wasserman [NKNW96]; Dash and Liu [DL97]; and Liu and Motoda
[LM98a, LM98b]. A combination forward selection and backward elimination method
 
Search WWH ::




Custom Search