Introduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

universal instance space is defined as a cartesian product of all input attribute domain

and the target attribute domain.

The two basic and classical problems that belong to the supervised learning cat-

egory are classification and regression. In classification, the domain of the target

attribute is finite and categorical. That is, there are a finite number of classes or cate-

gories to predict a sample and they are known by the learning algorithm. A classifier

must assign a class to a unseen example when it is trained by a set of training data.

The nature of classification is to discriminate examples from others, attaining as a

main application a reliable prediction: once we have a model that fits the past data,

if the future is similar to the past, then we can make correct predictions for new

instances. However, when the target attribute is formed by infinite values, such as in

the case of predicting a real number between a certain interval, we are referring to

regression problems. Hence, the supervised learning approach here has to fit a model

to learn the output target attribute as a function of input attributes. Obviously, the

regression problem present more difficulties than the classification problem and the

required computation resources and the complexity of the model are higher.

There is another type of supervised learning that involves time data. Time series

analysis is concerned with making predictions in time. Typical applications include

analysis of stock prices, market trends and sales forecasting. Due to the time depen-

dence of the data, the data preprocessing for time series data is different from the

main theme of this topic. Nevertheless, some basic procedures may be of interest

and will be also applicable in this field.

1.4 Unsupervised Learning

We have seen that in supervised learning, the aim is to obtain a mapping from the

input to an output whose correct and definite values are provided by a supervisor. In

unsupervised learning, there is no such supervisor and only input data is available.

Thus, the aim is now to find regularities, irregularities, relationships, similarities and

associations in the input. With unsupervised learning, it is possible to learn larger

and more complex models than with supervised learning. This is because in super-

vised learning one is trying to find the connection between two sets of observations.

The difficulty of the learning task increases exponentially with the number of steps

between the two sets and that is why supervised learning cannot, in practice, learn

models with deep hierarchies.

Apart from the two well-known problems that belong to the unsupervised learning

family, clustering and association rules, there are other related problems that can fit

into this category:

Search WWH ::

Custom Search

Home