Correlation - Data Mining for the Masses

Database Reference

In-Depth Information

DEPLOYMENT

The concept of deployment in data mining means doing something with what you've learned from

your model; taking some action based upon what your model tells you. In this chapter's example,

we conducted some basic, exploratory analysis for our fictional figure, Sarah. There are several

possible outcomes from this investigation.

We learned through our investigation, that the two most strongly correlated attributes in our data

set are Heating_Oil and Avg_Age, with a coefficient of 0.848. Thus, we know that in this data set,

as the average age of the occupants in a home increases, so too does the heating oil usage in that

home. What we do not know is why that occurs. Data analysts often make the mistake of

interpreting correlation as causation. The assumption that correlation proves causation is

dangerous and often false .

Consider for a moment the correlation coefficient between Avg_Age and Temperature: -0.673.

Referring back to Figure 4-7, we see that this is considered to be a relatively strong negative

correlation. As the age of a home's residents increases, the average temperature outside decreases;

and as the temperature rises, the age of the folks inside goes down. But could the average age of a

home's occupants have any effect on that home's average yearly outdoor temperature? Certainly

not. If it did, we could control the temperature by simply moving people of different ages in and

out of homes. This of course is silly. While statistically, there is a correlation between these two

attributes in our data set, there is no logical reason that movement in one causes movement in the

other. The relationship is probably coincidental, but if not, there must be some other explanation

that our model cannot offer. Such limitations must be recognized and accepted in all data mining

deployment decisions.

Another false interpretation about correlations is that the coefficients are percentages, as if to say

that a correlation coefficient of 0.776 between two attributes is an indication that there is 77.6%

shared variability between those two attributes. This is not correct. While the coefficients do tell a

story about the shared variability between attributes, the underlying mathematical formula used to

calculate correlation coefficients solely measures strength, as indicated by proximity to 1 or -1, of

the interaction between attributes. No percentage is calculated or intended.

Search WWH ::

Custom Search

Home