Database Reference
In-Depth Information
referring to the sequences in the database as a pseudo-projection . Every projection
consists of two pieces of information: pointer to the sequence in database and offset
of the postfix in the sequence. This avoids physically copying postfixes. Thus, it
is efficient in terms of both running time and space. However, it is not efficient if
the pseudo-projection is used for disk-based accessing since random access of disk
space is very costly. Therefore, when the sequence database cannot be held in main
memory, a bi-level projection method is explored, which projects databases not at
every level but at every two levels. In comparison with level-by-level projection ,
bi-level projection reduces the cost of database projection and leads to improved
performance when the database is huge and the support threshold is low.
A systematic performance study in [ 30 ] shows that PrefixSpan with these two
optimizations is efficient and scalable. It mines the complete set of patterns and runs
considerably faster than both Apriori-based GSP algorithm [ 35 ] and FreeSpan [ 17 ].
5
Further Development of Pattern Growth-Based Pattern
Mining Methodology
Since the proposal of the pattern-growth approach for mining frequent patterns [ 18 ], a
lot of research has been conducted that extends this methodology in multiple frontiers,
including further enhance pattern-growth efficiency at mining frequent patterns (such
as [ 14 , 31 ]), mining structured patterns by pattern growth [ 37 , 38 ], mining colossal
patterns [ 39 ], mining approximate patterns [ 9 ], mining multi-dimensional patterns
[ 33 ], pattern-based clustering [ 32 ], and pattern-based classification [ 10 , 11 ].
First, the pattern growth approach has been extended to mine frequent substruc-
tures such as frequent subgraph patterns. Graph is a general data structure at modeling
sophisticated interconnected data objects, with broad applications including chemi-
cal informatics, bioinformatics, computer vision, video indexing, text retrieval, and
Web analysis. Among the various kinds of graph patterns, frequent substructures are
the very basic patterns that can be discovered in a collection of graphs. The pattern-
growth mining algorithms, represented by gSpan [ 37 ] for mining frequent subgraph
patterns, extends a frequent graph by adding a new edge, in every possible position.
A potential problem with the edge extension is that the same graph can be discovered
many times. The gSpan algorithm solves this problem by introducing a right-most
extension technique, where the extensions will only take place on the right-most
path. A right-most path is the straight path from the starting vertex v 0 to the last
vertex v n , according to a depth-first search on the graph. With such an extension, the
pattern-growth approach can be extended to mining frequent subgraph patterns with
high efficiency. Further, it is also more desirable to mine closed subgraph patterns
directly than first mine frequent graph patterns and then filter out those subgraphs
that share the same frequency support as their super-graphs. CloseGraph [ 38 ]isone
such pattern graph algorithm that checks the size of the project graph datasets to
prune those paths that cannot generate new closed subgraph patterns.
Search WWH ::




Custom Search