Database Reference
In-Depth Information
all interesting patterns. This is mostly due to that these measures are typically not
monotonic—and hence do not easily allow for efficient search—as well as that it
is often difficult to express a meaningful threshold (i.e., significance level). Instead,
these measures are used to rank a given collection of patterns, e.g., mined all frequent
patterns up to a certain support threshold, or, when practical bounds are available, to
mine a top- k of the most significant patterns. Many of the authors of work we survey
in this section argue that in practice analysts do not want, nor have time, to consider
all patterns, and hence a small list of the most interesting patterns is preferable.
4.1
Independence Model
We start with the simplest background model, which is the model where we assume
that the individual items are all independent. Under this assumption we expect the
frequency of a given itemset X
=
x 1 ···
x n to be equal to
n
ind ( X )
=
fr ( x i )
.
i
=
1
The background knowledge we use are simply the frequencies of the individual
items, which can be straightforwardly computed from the data. Moreover, it seems
reasonable to expect to expect the data analyst (e.g., store manager) to know these
margins (e.g., how often each product is sold) and hence be able to make such infer-
ences intuitively. As such, the independence model is expected to correctly identify
'boring' patterns, patterns for which the frequencies follow under the independence
model.
Testing Observations against Expectations Now that we have a model, we will
use it as an exemplar to discuss a range of widely used methods for comparing the
observed measurement with the expectation. After covering these general methods,
we will discuss more detailed models, and more specialized measures.
With the above, we can compute both the observed frequency fr ( X ) and the
expectation ind ( X ) of the independence model. The next step is compare these two
quantities. A straightforward way to do this is to consider their ratio, a measure
known as lift [ 33 ], and formally defined as
lift ( X )
=
fr ( X ) /ind ( X ) .
Here we consider itemsets that have a high lift to be interesting, that is, itemsets
whose observed support is substantially higher than the independence assumption.
Hence, a larger ratio implies higher interestingness.
In our example, we have fr ( ab )
=
0 . 66, while under the independence model we
5 × 4
6 × 6
have ind ( ab )
=
=
0 . 55. As such, we find lift ( ab )
=
1 . 2. For abc , on the other
hand, we have lift ( abc )
=
0 . 33 / 0 . 18
=
1 . 83. While both patterns have a positive lift
Search WWH ::




Custom Search