Feature Selection - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

which it may work. It is normal to compare one method with another or a subset

of a previously proposed model to enhance and justify its new benefits and also to

comment on it and find out when it does not work.

To measure the concerns described above, one have to appeal to quantitative

measures that overall define the performance of a method. Performance can be seen

as a list of objectives and, for FS, the list is basically composed by three main goals:

•

Inferability: For predictive tasks, assumed as the main purpose for which FS is

developed, considered as an improvement of the prediction of unseen examples

with respect to the direct usage of the raw training data. In other words, themodel or

structural representation obtained from the subset of features by theDMalgorithms

obtained better predictive capability than that built from the original data.

•

Interpretability: Again considering predictive tasks, related to the model generated

by the DM algorithm. Given the incomprehension of raw data by humans, DM

is also used for generating more understandable structure representation that can

explain the behavior of the data. It is obvious to pursue the simplest possible

structural representation because the simpler a representation is, the easier is to

interpret. This goal is at odds with accuracy.

•

Data Reduction: Closely related to the previous goal, but in this case referring to

the data itself, without involving any DM algorithms. It is better and simpler, from

any point of view, to handle data with lower dimensions in terms of efficiency and

interpretability. However, evidence shows that it is not true that the greater the

reduction of the number of features, the better the understandability.

Our expectation is to increase the three goals mentioned above at the same time.

However, it is amulti-objective optimization problemwith conflicting sub-objectives,

and it is necessary to find a good trade-off depending on the practice or on the

application in question. We can derive three assessment measures from these three

goals to be evaluated independently:

•

Accuracy: It is the most commonly used measure to estimate the predictive power

and generalizability of a DM algorithm. A high accuracy shows that a learned

model works well on unseen data.

•

Complexity: It indirectly measures the interpretability of a model. A model is

structured according to a union of simpler elements, thus if the number of such

elements is low, the complexity is also low. For instance, a decision tree is com-

posed by branches, leaves and nodes as its basic elements. In a standard decision

tree, the number of leaves is equal to the number of branches, although there may

be branches of different lengths. The number of nodes in a branch can define

the complexity of this branch. Even for each node, the mathematical expression

used inside for splitting data can have one or more comparisons or operators.

All together, the count of all of these elements may define the complexity of a

representation.

•

Number of features selected: A measure for assessing the size of the data. Small

data sets mean fewer potential hypotheses to be learned, faster learning and simpler

results.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home