Discriminant Analysis - Data Mining for the Masses

Database Reference

In-Depth Information

whether or not to swing a bat, pass a ball, move to a potentially advantageous location of a

playing surface, etc. Their scores were to have been recorded on a scale of 0 to 100,

though Gill has indicated that no one who completed the test should have been able to

score lower than a 3, as three points are awarded simply for successfully entering and

exiting the decision making part of the battery. Gill knows that all 493 of his former

athletes represented in this data set successfully entered and exited this portion, but there

are a few scores lower than 3, and also a few over 100 in the data set, so we know we have

some data preparation in our future.



Prime_Sport : This attribute is the sport each of the 453 athletes went on to specialize in

after they left Gill's academy. This is the attribute Gill is hoping to be able to predict for

his current clients. For the boys in this study, this attribute will be one of four sports:

football (American, not soccer; sorry soccer fans), Basketball, Baseball, or Hockey.

As we analyze and familiarize ourselves with these data, we realize that all of the attributes with the

exception of Prime_Sport are numeric, and as such, we could exclude Prime_Sport and conduct a

k-means clustering data mining exercise on the data set. Doing this, we might be able group

individuals into one sport cluster or another based on the means for each of the attributes in the

data set. However, having the Prime_Sport attribute gives us the ability to use a different type of

data mining model: Discriminant Analysis . Discriminant analysis is a lot like k-means clustering,

in that it groups observations together into like-types of values, but it also gives us something

more, and that is the ability to predict . Discriminant analysis then helps us cross that intersection

seen in the Venn diagram in Chapter 1 (Figure 1-2). It is still a data mining methodology for

classifying observations, but it classifies them in a predictive way . When we have a data set that

contains an attribute that we know is useful in predicting the same value for other observations

that do not yet have that attribute, then we can use training data and scoring data to mine

predictively. Training data are simply data sets that have that known prediction attribute. For the

observations in the training data set, the outcome of the prediction attribute is already known. The

prediction attribute is also sometimes referred to as the dependent attribute (or variable) or the

target attribute . It is the thing you are trying to predict. RapidMiner will ask us to set this

attribute to be the label when we build our model. Scoring data are the observations which have

all of the same attributes as the training data set, with the exception of the prediction attribute. We

can use the training data set to allow RapidMiner to evaluate the values of all our attributes in the

context of the resulting prediction variable (in this case, Prime_Sport), and then compare those

values to the scoring data set and predict the Prime_Sport for each observation in the scoring data

Search WWH ::

Custom Search

Home