Data Preprocessing - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

(c) The automatic generation of a concept hierarchy for numeric data based on the

equal-frequency partitioning rule.

3.14 Robust data loading poses a challenge in database systems because the input data are

often dirty. In many cases, an input record may miss multiple values; some records

could be contaminated , with some data values out of range or of a different data type

than expected. Work out an automated data cleaning and loading algorithm so that the

erroneous data will be marked and contaminated data will not be mistakenly inserted

into the database during data loading.

3.8 Bibliographic Notes

Data preprocessing is discussed in a number of textbooks, including English [Eng99],

Pyle [Pyl99], Loshin [Los01], Redman [Red01], and Dasu and Johnson [DJ03]. More

specific references to individual preprocessing techniques are given later.

For discussion regarding data quality, see Redman [Red92]; Wang, Storey, and

Firth [WSF95]; Wand and Wang [WW96]; Ballou and Tayi [BT99]; and Olson [Ols03].

Potter's Wheel ( control.cx.berkely.edu/abc ), the interactive data cleaning tool described in

Section 3.2.3, is presented in Raman and Hellerstein [RH01]. An example of the devel-

opment of declarative languages for the specification of data transformation operators is

given in Galhardas et al. [GFS C 01]. The handling of missing attribute values is discussed

in Friedman [Fri77]; Breiman, Friedman, Olshen, and Stone [BFOS84]; and Quinlan

[Qui89]. Hua and Pei [HP07] presented a heuristic approach to cleaning disguised miss-

ing data , where such data are captured when users falsely select default values on forms

(e.g., “January 1” for birthdate ) when they do not want to disclose personal information.

A method for the detection of outlier or “garbage” patterns in a handwritten char-

acter database is given in Guyon, Matic, and Vapnik [GMV96]. Binning and data

normalization are treated in many texts, including Kennedy et al. [KLV C 98], Weiss

and Indurkhya [WI98], and Pyle [Pyl99]. Systems that include attribute (or feature)

construction include BACON by Langley, Simon, Bradshaw, and Zytkow [LSBZ87];

Stagger by Schlimmer [Sch86]; FRINGE by Pagallo [Pag89]; and AQ17-DCI by Bloe-

dorn and Michalski [BM98]. Attribute construction is also described in Liu and Motoda

[LM98a, LM98b]. Dasu et al. built a BELLMAN system and proposed a set of interesting

methods for building a data quality browser by mining database structures [DJMS02].

A good survey of data reduction techniques can be found in Barbar a et al. [BDF C 97].

For algorithms on data cubes and their precomputation, see Sarawagi and Stonebraker

[SS94]; Agarwal et al. [AAD C 96]; Harinarayan, Rajaraman, and Ullman [HRU96]; Ross

and Srivastava [RS97]; and Zhao, Deshpande, and Naughton [ZDN97]. Attribute sub-

set selection (or feature subset selection ) is described in many texts such as Neter, Kutner,

Nachtsheim, and Wasserman [NKNW96]; Dash and Liu [DL97]; and Liu and Motoda

[LM98a, LM98b]. A combination forward selection and backward elimination method

Search WWH ::

Custom Search

Home