Database Reference
In-Depth Information
frequent items. These candidate sequences are used to construct a PLWAP Tree for
the current batch, which in turn identifies actual frequent patterns for insertion in the
FSP Tree.
Mendes et al. [ 64 ] develop two remarkably straightforward algorithms: SS-BE
based on LOSSY COUNTING [ 61 ] and SS-MB based on SPACE-SAVING [ 65 ]. Like
LOSSY COUNTING, they operate in the batch-processed landmark window domain.
Recently, Koper and Nguyen [ 43 ] have improved the SS-BE algorithm, via changes
to the pruning and other criteria. Chang et al. [ 16 ] have developed a method for
discovering frequent closed sequences.
4.2
Subtrees and Semistructured Data
Subtrees are an interesting pattern because on one hand they are a special case of
subgraphs and on the other hand they are the typical abstraction of semistructured
data like XML. Mining frequent trees has many important applications, such text
retrieval, web analysis, computer vision, bio- and chemical- informatics. In [ 6 ],
the authors model semi-structured data and patterns as labeled ordered trees. They
present an algorithm STREAMT which receives fragments of semi-structured data
from a document of unknown total length through a data stream, which then returns
the current set of frequent patterns upon request. This algorithm follows the damped
window model, in which older items have exponentially less importance.
The data stream is assumed to be a sequence of labeled elements, each label l
originating from a finite alphabet ( L )
1 1 ,1 2 , ... , l k ]. Each element may have
nested within it a sequence of additional labeled elements. Each nested element may
have its own nested elements. XML fits this model nicely. It is clear that this structure
maps to a forest of rooted trees which are traversed depth first. Each root in the forest
corresponds to the top-level element in the data stream. For convenience, The forest
is treated as a single tree T , in which the separate roots are each a child of a master
root.
A pattern P is also a tree; every subtree of T is a pattern. Two patterns match
if they are graph-isomorphic with matching node labels. Imagine a tree containing
N nodes, f ( L ) of which are labeled L . The pattern consisting of the single node L
thus has support level F ( L ) /N . If half of the L nodes have a child with label M ,
then the two-node pattern [ L
={
childM ] has support level F ( L ) / 2 N . When the tree
becomes too large, infrequent branches may be pruned. The tree is exact up until the
time of the first pruning. Because pruning removes some history, we may have false
negative errors after that point.
Another work which seeks frequent trees from a XML data stream is [ 54 ]. Each
transaction is a query in the form of an XML tree. We seek frequent common subtrees.
Each query tree is converted to a standard linear format via a depth-first search. It
is then added to a forest which records the accumulated query trees. The algorithm
follows the landmark model, so the forest summarizes the full history; however,
infrequent subtrees of the forest are periodically pruned.
+
Search WWH ::




Custom Search