Pattern-Growth Methods - Frequent Pattern Mining

Database Reference

In-Depth Information

referring to the sequences in the database as a pseudo-projection . Every projection

consists of two pieces of information: pointer to the sequence in database and offset

of the postfix in the sequence. This avoids physically copying postfixes. Thus, it

is efficient in terms of both running time and space. However, it is not efficient if

the pseudo-projection is used for disk-based accessing since random access of disk

space is very costly. Therefore, when the sequence database cannot be held in main

memory, a bi-level projection method is explored, which projects databases not at

every level but at every two levels. In comparison with level-by-level projection ,

bi-level projection reduces the cost of database projection and leads to improved

performance when the database is huge and the support threshold is low.

A systematic performance study in [ 30 ] shows that PrefixSpan with these two

optimizations is efficient and scalable. It mines the complete set of patterns and runs

considerably faster than both Apriori-based GSP algorithm [ 35 ] and FreeSpan [ 17 ].

5

Further Development of Pattern Growth-Based Pattern

Mining Methodology

Since the proposal of the pattern-growth approach for mining frequent patterns [ 18 ], a

lot of research has been conducted that extends this methodology in multiple frontiers,

including further enhance pattern-growth efficiency at mining frequent patterns (such

as [ 14 , 31 ]), mining structured patterns by pattern growth [ 37 , 38 ], mining colossal

patterns [ 39 ], mining approximate patterns [ 9 ], mining multi-dimensional patterns

[ 33 ], pattern-based clustering [ 32 ], and pattern-based classification [ 10 , 11 ].

First, the pattern growth approach has been extended to mine frequent substruc-

tures such as frequent subgraph patterns. Graph is a general data structure at modeling

sophisticated interconnected data objects, with broad applications including chemi-

cal informatics, bioinformatics, computer vision, video indexing, text retrieval, and

Web analysis. Among the various kinds of graph patterns, frequent substructures are

the very basic patterns that can be discovered in a collection of graphs. The pattern-

growth mining algorithms, represented by gSpan [ 37 ] for mining frequent subgraph

patterns, extends a frequent graph by adding a new edge, in every possible position.

A potential problem with the edge extension is that the same graph can be discovered

many times. The gSpan algorithm solves this problem by introducing a right-most

extension technique, where the extensions will only take place on the right-most

path. A right-most path is the straight path from the starting vertex v 0 to the last

vertex v n , according to a depth-first search on the graph. With such an extension, the

pattern-growth approach can be extended to mining frequent subgraph patterns with

high efficiency. Further, it is also more desirable to mine closed subgraph patterns

directly than first mine frequent graph patterns and then filter out those subgraphs

that share the same frequency support as their super-graphs. CloseGraph [ 38 ]isone

such pattern graph algorithm that checks the size of the project graph datasets to

prune those paths that cannot generate new closed subgraph patterns.

Search WWH ::

Custom Search

Home