Databases Reference
In-Depth Information
lArGe DAtA volumes
The best way to reduce the effect of truly random fluctuations in cor-
relation is by incorporating large volumes of data. In statistical analysis
this is the concept of statistical significance. In general, a larger sample
set generates a result with higher statistical significance and a smaller
sample set generates a result with lower statistical significance. Higher
statistical significance occurs as an increasing portion of the randomness
in a sample set is explained. This does not mean that a large sample set
will by its own power and volume explain away all randomness. If that
were the case, Market Basket Analysis would have been reduced to an
exercise in data gathering, rather than data analysis. Instead, larger data
volumes (i.e., large sample sets) simply increase the scope and perspec-
tive of the world as viewed through the data, like seeing the forest rather
than the trees.
To illustrate the effect of sample size on the conclusions drawn during
Market Basket Analysis, consider an investigation into the correlation
between two objects—milk and cookies. Recognizing the wonderful expe-
rience of dunking a cookie into a glass of milk and then eating that cookie,
you would expect milk and cookies to have a high aἀ nity for each other.
In a sample set of two Itemsets you find the following:
• Milk and no cookies
• Cookies and no milk
In this abbreviated sample set milk and cookies seem to be substitutes as
they never occur simultaneously in the same Itemset. However, recogniz-
ing the brevity of the sample set you include another Itemset. Now the
sample set includes the following three Itemsets:
• Milk and no cookies
• Cookies and no milk
• Milk and cookies
A conclusion drawn from this second sample set would indicate that
two-thirds of the time, milk and cookies are substitutes, and one-third
of the time milk and cookies are complementary. Not until you increase
 
Search WWH ::




Custom Search