Frequent Pattern Mining in Data Streams - Frequent Pattern Mining

Database Reference

In-Depth Information

For any i -subset that is already in

, the process will be the same as for a 2-itemset,

i.e, we will simply increment the count. However, an i -subset that is not in

L

will be

inserted in

L

only if all of its i −

1 subsets are in

L

as well. Thus, we use the Apriori

property.

After updating i -itemsets in

, we again invoke the REDUCFREQ routine. Thus,

the itemsets whose count is only 1 will be deleted from the lattice. This procedure

will continue until there are no frequent k -itemsets in

L

. At the end of this, we clear

the buffer, and start processing new transactions in the stream. This will restart the

first phase of our algorithm to deal with 2-itemsets.

In their experimental results, Jin and Agrawal show that STREAMMINING is more

memory efficient than LOSSY COUNTING. On the T10.I4.N10K dataset used in Manku

and Motwani's paper, with 1 million transactions and a support level of 1 %, LOSSY

COUNTING requires an out-of-core data structure on top ofaa44MBbuffer.Onthe

other hand, for larger datasets ranging from 4 million to 20 million transactions, Jin

and Agrawal's algorithm only requires 2.5 MB in main memory.

L

Tilted Time Window Model Giannella et al. [ 28 ] employ a tilted time window

model to record a compressed history of the entire data stream's frequent patterns.

Its unique windowing structure allows it to approximately answer queries about

frequent itemsets for a time window from any t 1 to t 2 . The window size of the

response is within 50 % of the size requested by the user.

The most unique feature are the time windows of growing width. Data are pro-

cessed and recorded in batches of uniform size, but as time progresses, older records

are merged together. This leverages the fact that users often want to know about re-

cent history with fine granularity but are satisfied with a coarser granularity for longer

or older time periods. Because the window sizes grow exponentially, the number of

windows grows logarithmically, thus making the memory requirements tractable.

The candidate frequent patterns for each data batch are discovered using an FP-

Tree with error factor , like the one in LOSSY COUNTING [ 61 ]. These candidate

patterns and their counts are summarized in a table. f ( i , j ) means the table of frequent

patterns for the time interval [ t i : t j ]. As batches become older, they are merged with

adjacent batches into windows with sizes that are powers of 2. If the total number

of elapsed batches is 2 n for some integer n , then there will n

1 windows of size 1,

1, 2, 4, 8, etc. When the total number of batches is not such a perfect number, there

may be temporarily up to three windows covering just a single batch and up to two

windows for larger sizes. For example, if 16 batches have been processed, then the

windows would be sized as follows:

+

f (0, 7), f (8, 11), f (12, 13), f (14), f (15)

However, if only 15 batches have been processed, then these would be the sizes:

f (0, 3), f (4, 7), f (8, 9), f (10, 11), f (12), f (13), f (14)

Similar to how LOSSY COUNTING will delete an itemset if its support level fails

below , the tilted time window algorithm (FP-STREAMING also drops infrequent

Frequent Pattern Mining

Search WWH ::

Custom Search

Home