Frequent Pattern Mining in Data Streams - Frequent Pattern Mining

Database Reference

In-Depth Information

set which is likely to contain almost all the frequent itemsets, but might miss some

of them (with very small probability controlled by δ ).

Finally, while LOSSY COUNTING works for any data stream, the probabilistic

performance guarantee of FPDM only applies if the contents of each transaction is

independent of each other. In real applications, this is often not true. The authors of

FPDM [ 77 ] suggest random sampling from a data reservoir to alleviate this problem

[ 25 ], at the cost of doubling the memory requirement.

3.2

Recently Frequent Itemsets

We now look at various ways to focus the selection of frequent itemsets on more

recent data: the damped window model, the sliding window model, and the tilted-time

model.

Damped Window Model Chang and Lee [ 14 ] study the problem of finding recently

frequent itemsets over data streams using the damped window model. Specifically,

in their model, the weight of previously recorded transactions in the data stream are

periodically reduced by a decay factor d , where 0 <d

1. In their algorithm, this

decayed weight is used counting the number of transactions and itemsets received.

For example, the initial weight of a newly arrived transaction has weight 1. Suppose

n 1 transactions arrive in the first time window and n 2 arrive in the second time window.

At the end of the second window, the weighted count of transactions is

≤

| t = 2

n 2 +

n 1 d 2 .

The weights of itemsets are discounted similarly. However, rather than updating the

value of all stored itemsets every time period, a more efficient approach can be to

update a value only when an itemset's weight needs to be read. To implement this

we record both a count and the timestamp of the last update: ( f ( e ), t ( e )). When a

new instance of itemset e arrives at time t k , we update the record as follows:

f ( e )

| t = 3 =

n 3 +

n 1 d . At the end of the third time window, we have

n 2 d

f ( e )

d t k − t ( e )

←

(9.5)

t ( e )

←

t k

Combining the damped window model with an method for estimating itemset counts,

Chang and Lee produce the estDec algorithm. Their estimation method is similar to

Carma [ 35 ]. Let e m be an itemset contain m items. The Apriori principle tells us that

the count for e m cannot be greater than the count for any of its ( m

−

1)-subitemsets.

−

Let E m − 1 ( e ) be the set of all these ( m

1)-sized subsets. This sets an upper bound:

f ( e ) max

min

{

f ( a )

∈

E m − 1 ( e )

}

(9.6)

Then, noting relationships between unions and intersections of itemsets, they provide

a lower limit bound:

f ( e ) min

f min ( a

max

{

∪

b )

a , b

∈

E m − 1 ( e )and a

}

(9.7)

They maintain an exact count of individual items and then use these bounds to es-

timate the count and error for itemsets that were fully recorded because they were

Frequent Pattern Mining

Search WWH ::

Custom Search

Home