Database Reference
In-Depth Information
set which is likely to contain almost all the frequent itemsets, but might miss some
of them (with very small probability controlled by δ ).
Finally, while LOSSY COUNTING works for any data stream, the probabilistic
performance guarantee of FPDM only applies if the contents of each transaction is
independent of each other. In real applications, this is often not true. The authors of
FPDM [ 77 ] suggest random sampling from a data reservoir to alleviate this problem
[ 25 ], at the cost of doubling the memory requirement.
3.2
Recently Frequent Itemsets
We now look at various ways to focus the selection of frequent itemsets on more
recent data: the damped window model, the sliding window model, and the tilted-time
model.
Damped Window Model Chang and Lee [ 14 ] study the problem of finding recently
frequent itemsets over data streams using the damped window model. Specifically,
in their model, the weight of previously recorded transactions in the data stream are
periodically reduced by a decay factor d , where 0 <d
1. In their algorithm, this
decayed weight is used counting the number of transactions and itemsets received.
For example, the initial weight of a newly arrived transaction has weight 1. Suppose
n 1 transactions arrive in the first time window and n 2 arrive in the second time window.
At the end of the second window, the weighted count of transactions is
|
D
| t = 2
=
n 2 +
n 1 d 2 .
The weights of itemsets are discounted similarly. However, rather than updating the
value of all stored itemsets every time period, a more efficient approach can be to
update a value only when an itemset's weight needs to be read. To implement this
we record both a count and the timestamp of the last update: ( f ( e ), t ( e )). When a
new instance of itemset e arrives at time t k , we update the record as follows:
f ( e )
|
| t = 3 =
n 3 +
+
n 1 d . At the end of the third time window, we have
D
n 2 d
f ( e )
d t k t ( e )
+
1
(9.5)
t ( e )
t k
Combining the damped window model with an method for estimating itemset counts,
Chang and Lee produce the estDec algorithm. Their estimation method is similar to
Carma [ 35 ]. Let e m be an itemset contain m items. The Apriori principle tells us that
the count for e m cannot be greater than the count for any of its ( m
1)-subitemsets.
Let E m 1 ( e ) be the set of all these ( m
1)-sized subsets. This sets an upper bound:
f ( e ) max
=
min
{
f ( a )
|
a
E m 1 ( e )
}
(9.6)
Then, noting relationships between unions and intersections of itemsets, they provide
a lower limit bound:
f ( e ) min
f min ( a
=
max
{
b )
|
a , b
E m 1 ( e )and a
=
b
}
(9.7)
They maintain an exact count of individual items and then use these bounds to es-
timate the count and error for itemsets that were fully recorded because they were
Search WWH ::




Custom Search