Frequent Pattern Mining in Data Streams - Frequent Pattern Mining

Database Reference

In-Depth Information

frequent items. These candidate sequences are used to construct a PLWAP Tree for

the current batch, which in turn identifies actual frequent patterns for insertion in the

FSP Tree.

Mendes et al. [ 64 ] develop two remarkably straightforward algorithms: SS-BE

based on LOSSY COUNTING [ 61 ] and SS-MB based on SPACE-SAVING [ 65 ]. Like

LOSSY COUNTING, they operate in the batch-processed landmark window domain.

Recently, Koper and Nguyen [ 43 ] have improved the SS-BE algorithm, via changes

to the pruning and other criteria. Chang et al. [ 16 ] have developed a method for

discovering frequent closed sequences.

4.2

Subtrees and Semistructured Data

Subtrees are an interesting pattern because on one hand they are a special case of

subgraphs and on the other hand they are the typical abstraction of semistructured

data like XML. Mining frequent trees has many important applications, such text

retrieval, web analysis, computer vision, bio- and chemical- informatics. In [ 6 ],

the authors model semi-structured data and patterns as labeled ordered trees. They

present an algorithm STREAMT which receives fragments of semi-structured data

from a document of unknown total length through a data stream, which then returns

the current set of frequent patterns upon request. This algorithm follows the damped

window model, in which older items have exponentially less importance.

The data stream is assumed to be a sequence of labeled elements, each label l

originating from a finite alphabet ( L )

1 1 ,1 2 , ... , l k ]. Each element may have

nested within it a sequence of additional labeled elements. Each nested element may

have its own nested elements. XML fits this model nicely. It is clear that this structure

maps to a forest of rooted trees which are traversed depth first. Each root in the forest

corresponds to the top-level element in the data stream. For convenience, The forest

is treated as a single tree T , in which the separate roots are each a child of a master

root.

A pattern P is also a tree; every subtree of T is a pattern. Two patterns match

if they are graph-isomorphic with matching node labels. Imagine a tree containing

N nodes, f ( L ) of which are labeled L . The pattern consisting of the single node L

thus has support level F ( L ) /N . If half of the L nodes have a child with label M ,

then the two-node pattern [ L

={

childM ] has support level F ( L ) / 2 N . When the tree

becomes too large, infrequent branches may be pruned. The tree is exact up until the

time of the first pruning. Because pruning removes some history, we may have false

negative errors after that point.

Another work which seeks frequent trees from a XML data stream is [ 54 ]. Each

transaction is a query in the form of an XML tree. We seek frequent common subtrees.

Each query tree is converted to a standard linear format via a depth-first search. It

is then added to a forest which records the accumulated query trees. The algorithm

follows the landmark model, so the forest summarizes the full history; however,

infrequent subtrees of the forest are periodically pruned.

+

Frequent Pattern Mining

Search WWH ::

Custom Search

Home