Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

•

Predictive Classification Rate: A successful algorithm will often be able to dis-

cretize the training set without significantly reducing the prediction capability of

learners in test data which are prepared to treat numerical data.

•

Time requirements: A static discretization process is carried out just once on a

training set, so it does not seem to be a very important evaluation method. How-

ever, if the discretization phase takes too long it can become impractical for real

applications. In dynamic discretization, the operation is repeated as many times

as the learner requires, so it should be performed efficiently.

9.3.2 Methods and Taxonomy

At the time of writting, more than 80 discretizationmethods have been proposed in the

literature. This section is devoted to enumerating and designating them according to

a standard followed in this chapter. We have used 30 discretizers in the experimental

study, those that we have identified as the most relevant ones. For more details on

their descriptions, the reader can visit the URL associated to the KEEL project. 1

Additionaly, implementations of these algorithms in Java can be found in KEEL

software [ 3 , 4 ].

Table 9.1 presents an enumeration of discretizers reviewed in this chapter. The

complete name, abbreviation and reference are provided for each one. This chapter

does not collect the descriptions of the discretizers. Instead, we recommend that

readers consult the original references to understand the complete operation of the

discretizers of interest. Discretizers used in the experimental study are depicted in

bold. The ID3 discretizer used in the study is a static version of the well-known

discretizer embedded in C4.5.

The properties studied above can be used to categorize the discretizers proposed in

the literature. The seven characteristics studied allows us to present the taxonomy of

discretizationmethods in an established order. All techniques enumerated inTable 9.1

are collected in the taxonomy drawn in Fig. 9.2 . It illustrates the categorization

following a hierarchy based on this order: static/dynamic, univariate/multivariate,

supervised/unsupervised, splitting/merging/hybrid, global/local, direct/incremental

and evaluation measure. The rationale behind the choice of this order is to achieve a

clear representation of the taxonomy.

The proposed taxonomy assists us in the organization of many discretization

methods so that we can classify them into categories and analyze their behavior. Also,

we can highlight other aspects in which the taxonomy can be useful. For example, it

provides a snapshot of existing methods and relations or similarities among them. It

also depicts the size of the families, the work done in each one and what is currently

missing. Finally, it provides a general overview of the state-of-the-art methods in

discretization for researchers/practitioners who are beginning in this field or need to

discretize data in real applications.

1 http://www.keel.es .

Search WWH ::

Custom Search

Home