Association Rules - Data Mining for the Masses

Database Reference

In-Depth Information

Figure 5-1. Adding the data for the Chapter 5 example model.

3) In results perspective, look first at Meta Data view (Figure 5-2). Note that we do not have

any missing values among any of the 12 attributes across 3,483 observations. In examining

the statistics, we do not see any inconsistent data. For numeric data types, RapidMiner has

given us the average (avg), or mean , for each attribute, as well the standard deviation for

each attribute. Standard deviations are measurements of how dispersed or varied the

values in an attribute are, and so can be used to watch for inconsistent data. A good rule

of thumb is that any value that is smaller than two standard deviations below the mean (or

arithmetic average), or two standard deviations above the mean, is a statistical outlier. For

example, in the Age attribute in Figure 5-2, the average age is 36.731, while the standard

deviation is 10.647. Two standard deviations above the mean would be 58.025

(36.731+(2*10.647)), and two standard deviations below the mean would be 15.437

(36.731-(2*10.647)). If we look at the Range column in Figure 5-2, we can see that the Age

attribute has a range of 17 to 57, so all of our observations fall within two standard

deviations of the mean. We find no inconsistent data in this attribute. This won't always

be the case, so a data miner should always be watchful for such indications of inconsistent

data. It's important to realize also that while two standard deviations is a guideline, it's not

a hard-and-fast rule. Data miners should be thoughtful about why some observations may

be legitimate and yet far from the mean, or why some values that fall within two standard

deviations of the mean should still be scrutinized. One other item should be noted as we

Search WWH ::

Custom Search

Home