Information Technology Reference
In-Depth Information
For that kind of data mining, we need to know the classes or goals our system
should predict. In most cases we might know these goals a-priori. However, there are
other tasks were the goals are not known a-priori. In that case, we have to find out the
classes based on methods such as clustering before we can go into predictive mining.
Furthermore, the prediction methods can be distinguished into classification and
regression while knowledge discovery can be distinguished into: deviation detection,
clustering, mining association rules, and visualization. To categorize the actual
problem into one of these problem types is the first necessary step when dealing with
Data Mining.
Note that Figure 4 only describes the basic types of data mining methods. We
consider for e.g. text mining, web mining or image mining only as variants of the
basic types of data mining which need a special data preparation.
4.2 Prediction
4.2.1 Classification
Assume there is a set of observations from a particular domain. Among this set of
data there is a subset of data labeled by class 1 and another subset of data labeled by
class 2. Each data entry is described by some descriptive domain variables and the
class label. To give the reader an idea, let us say we have collected information about
customers, such as marital status, sex, and number of children. The class label is the
information whether the customer has purchased a certain product or not. Now we
want to know how the group of buyers and non-buyers is characterized.
The task is now to find a mapping function that allows to separate samples belonging
to class 1 (e.g. the group of internet users) from those belonging to class 2 (e.g. the
group of people that do not use the internet). Furthermore, this function should allow
to predict the class membership of new formerly unseen samples.
Such kind of problems belong to the problem type "classification". There can be more
than two classes but for simplicity we are only considering the two-class problem.
The mapping function can be learnt by decision tree or rule induction, neural
networks, discriminate analysis or case-based reasoning. We will concentrate in this
paper on symbolic learning methods such as decision tree induction. The decision tree
learnt based on the data of our little example described above is shown in Figure 5.
The profile of the buyers is: marital_status = single, number_of_ children=0. The
profile of the non-buyers is: marital_status = married or marital_status = single and
number_of_children > 0. This information can be used to promote potential
customers.
4.2.2 Regression
Whereas classification determines the set membership of the samples, the answer of
regression is numerical. Suppose we have a CCD sensor. We give light of a certain
luminous intensity to this sensor. Then this light is transformed into a gray value by
the sensor according to a transformation function. If we change the luminous intensity
we also change the gray value. That means the variability of the output variable, will
be explained based on the variability of one or more input variables.
Search WWH ::




Custom Search