Database Reference
In-Depth Information
DEPLOYMENT
The concept of deployment in data mining means doing something with what you've learned from
your model; taking some action based upon what your model tells you. In this chapter's example,
we conducted some basic, exploratory analysis for our fictional figure, Sarah. There are several
possible outcomes from this investigation.
We learned through our investigation, that the two most strongly correlated attributes in our data
set are Heating_Oil and Avg_Age, with a coefficient of 0.848. Thus, we know that in this data set,
as the average age of the occupants in a home increases, so too does the heating oil usage in that
home. What we do not know is why that occurs. Data analysts often make the mistake of
interpreting correlation as causation. The assumption that correlation proves causation is
dangerous and often false .
Consider for a moment the correlation coefficient between Avg_Age and Temperature: -0.673.
Referring back to Figure 4-7, we see that this is considered to be a relatively strong negative
correlation. As the age of a home's residents increases, the average temperature outside decreases;
and as the temperature rises, the age of the folks inside goes down. But could the average age of a
home's occupants have any effect on that home's average yearly outdoor temperature? Certainly
not. If it did, we could control the temperature by simply moving people of different ages in and
out of homes. This of course is silly. While statistically, there is a correlation between these two
attributes in our data set, there is no logical reason that movement in one causes movement in the
other. The relationship is probably coincidental, but if not, there must be some other explanation
that our model cannot offer. Such limitations must be recognized and accepted in all data mining
deployment decisions.
Another false interpretation about correlations is that the coefficients are percentages, as if to say
that a correlation coefficient of 0.776 between two attributes is an indication that there is 77.6%
shared variability between those two attributes. This is not correct. While the coefficients do tell a
story about the shared variability between attributes, the underlying mathematical formula used to
calculate correlation coefficients solely measures strength, as indicated by proximity to 1 or -1, of
the interaction between attributes. No percentage is calculated or intended.
 
Search WWH ::




Custom Search