Databases Reference
In-Depth Information
(e.g., 10,000 to 100,000) but only a small number of samples (e.g., 100 to 1000). The
other direction develops a new mining methodology, called Pattern-Fusion , which mines
colossal patterns , that is, patterns of very long length.
Let's first briefly examine the first direction, in particular, a pattern growth-based row
enumeration approach. Its general philosophy is to explore the vertical data format , as
described in Section 6.2.5, which is also known as row enumeration . Row enumeration
differs from traditional column (i.e., item) enumeration (also known as the horizon-
tal data format ). In traditional column enumeration, the data set, D , is viewed as a
set of rows, where each row consists of an itemset. In row enumeration, the data set
is instead viewed as an itemset, each consisting of a set of row ID s indicating where the
item appears in the traditional view of D . The original data set, D , can easily be trans-
formed into a transposed data set, T . A data set with a small number of rows but a large
number of dimensions is then transformed into a transposed data set with a large num-
ber of rows but a small number of dimensions. Efficient pattern growth methods can
then be developed on such relatively low-dimensional data sets. The details of such an
approach are left as an exercise for interested readers.
The remainder of this section focuses on the second direction. We introduce Pattern-
Fusion, a new mining methodology that mines colossal patterns (i.e., patterns of very
long length). This method takes leaps in the pattern search space, leading to a good
approximation of the complete set of colossal frequent patterns.
7.4.1 Mining Colossal Patterns by Pattern-Fusion
Although we have studied methods for mining frequent patterns in various situations,
many applications have hidden patterns that are tough to mine, due mainly to their
immense length or size. Consider bioinformatics, for example, where a common activ-
ity is DNA or microarray data analysis. This involves mapping and analyzing very long
DNA and protein sequences. Researchers are more interested in finding large patterns
(e.g., long sequences) than finding small ones since larger patterns usually carry more
significant meaning. We call these large patterns colossal patterns, as distinguished from
patterns with large support sets. Finding colossal patterns is challenging because incre-
mental mining tends to get “trapped” by an explosive number of midsize patterns before
it can even reach candidate patterns of large size. This is illustrated in Example 7.10.
Example 7.10 The challenge of mining colossal patterns. Consider a 4040 square table where each
row contains the integers 1 through 40 in increasing order. Remove the integers on the
diagonal, and this gives a 4039 table. Add 20 identical rows to the bottom of the
table, where each row contains the integers 41 through 79 in increasing order, result-
ing in a 6039 table (Figure 7.6). We consider each row as a transaction and set the
minimum support threshold at 20. The table has an exponential number (i.e., 4 20
)
of midsize closed/maximal frequent patterns of size 20, but only one that is colossal:
D.
of size 39. None of the frequent pattern mining algorithms that
we have introduced so far can complete execution in a reasonable amount of time.
41, 42,
:::
, 79
/
 
Search WWH ::




Custom Search