Database Reference
In-Depth Information
whether or not to swing a bat, pass a ball, move to a potentially advantageous location of a
playing surface, etc. Their scores were to have been recorded on a scale of 0 to 100,
though Gill has indicated that no one who completed the test should have been able to
score lower than a 3, as three points are awarded simply for successfully entering and
exiting the decision making part of the battery. Gill knows that all 493 of his former
athletes represented in this data set successfully entered and exited this portion, but there
are a few scores lower than 3, and also a few over 100 in the data set, so we know we have
some data preparation in our future.
Prime_Sport : This attribute is the sport each of the 453 athletes went on to specialize in
after they left Gill's academy. This is the attribute Gill is hoping to be able to predict for
his current clients. For the boys in this study, this attribute will be one of four sports:
football (American, not soccer; sorry soccer fans), Basketball, Baseball, or Hockey.
As we analyze and familiarize ourselves with these data, we realize that all of the attributes with the
exception of Prime_Sport are numeric, and as such, we could exclude Prime_Sport and conduct a
k-means clustering data mining exercise on the data set. Doing this, we might be able group
individuals into one sport cluster or another based on the means for each of the attributes in the
data set. However, having the Prime_Sport attribute gives us the ability to use a different type of
data mining model: Discriminant Analysis . Discriminant analysis is a lot like k-means clustering,
in that it groups observations together into like-types of values, but it also gives us something
more, and that is the ability to predict . Discriminant analysis then helps us cross that intersection
seen in the Venn diagram in Chapter 1 (Figure 1-2). It is still a data mining methodology for
classifying observations, but it classifies them in a predictive way . When we have a data set that
contains an attribute that we know is useful in predicting the same value for other observations
that do not yet have that attribute, then we can use training data and scoring data to mine
predictively. Training data are simply data sets that have that known prediction attribute. For the
observations in the training data set, the outcome of the prediction attribute is already known. The
prediction attribute is also sometimes referred to as the dependent attribute (or variable) or the
target attribute . It is the thing you are trying to predict. RapidMiner will ask us to set this
attribute to be the label when we build our model. Scoring data are the observations which have
all of the same attributes as the training data set, with the exception of the prediction attribute. We
can use the training data set to allow RapidMiner to evaluate the values of all our attributes in the
context of the resulting prediction variable (in this case, Prime_Sport), and then compare those
values to the scoring data set and predict the Prime_Sport for each observation in the scoring data
Search WWH ::




Custom Search