Database Reference
In-Depth Information
4.1.2
Beyond Single Measurements
Instead of comparing just a single statistic, like support or the violation rate, we
can consider much richer information. One example is to compare the complete
contingency table of an itemset
X
with an expectation [
10
].
Assume we are given a distribution
p
over items in
X
=
x
1
···
X
M
. That is, a
distribution over 2
|
X
|
entries. For convenience, let us write
p
(
X
=
t
)
=
p
(
x
1
=
t
1
,
...
,
x
M
=
t
M
),
where
t
is a binary vector of length
M
. We now consider two different distributions:
the first is the
empirical distribution
computed from the dataset,
=
|{
u
∈
D
|
u
X
=
t
}|
p
emp
(
X
=
t
)
,
|
D
|
and the second,
p
ind
, is the independence model,
M
p
ind
(
X
=
t
)
=
p
ind
(
x
i
=
t
i
),
i
=
1
where the margins (item frequencies) are computed from the input data.
The standard way of comparing these two distributions is by doing a so-called
G
-test, which essentially is a log-likelihood ratio test,
2
t
∈
D
2
t
∈
D
log
p
emp
(
X
=
t
X
)
−
log
p
ind
(
X
=
t
X
)
.
Under the assumption that the items of
X
are distributed independently (which we
here do), this quantity approaches the
χ
2
distribution with 2
|
X
|
−
degrees of
freedom. Interestingly, this quantity can also be seen as a (scaled) Kullback-Leibler
divergence, 2
1
−|
X
|
|
|
KL
(
p
emp
||
p
ind
).
Alternatively, we can also compare the two distributions with Pearson's
χ
2
D
test,
t
)
2
)
(
p
emp
(
X
=
t
)
−
p
ind
(
X
=
|
D
|
,
p
emp
(
X
=
t
)
t
∈{
0,1
}
|
X
|
which has the same asymptotic behavior as the
G
-test.
Each of these tests can be used to determine the p-value, or likelihood, of a pattern
under an assumed model. In practice these measurements are used to rank the patterns
from most surprising (under the model) to least surprising, typically showing the user
only the top-
k
of most surprising patterns.
Next we will now look into more elaborate models, which allow us to make more
realistic assumptions about the data than complete independence.