Advanced Pattern Mining - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

(e.g., 10,000 to 100,000) but only a small number of samples (e.g., 100 to 1000). The

other direction develops a new mining methodology, called Pattern-Fusion , which mines

colossal patterns , that is, patterns of very long length.

Let's first briefly examine the first direction, in particular, a pattern growth-based row

enumeration approach. Its general philosophy is to explore the vertical data format , as

described in Section 6.2.5, which is also known as row enumeration . Row enumeration

differs from traditional column (i.e., item) enumeration (also known as the horizon-

tal data format ). In traditional column enumeration, the data set, D , is viewed as a

set of rows, where each row consists of an itemset. In row enumeration, the data set

is instead viewed as an itemset, each consisting of a set of row ID s indicating where the

item appears in the traditional view of D . The original data set, D , can easily be trans-

formed into a transposed data set, T . A data set with a small number of rows but a large

number of dimensions is then transformed into a transposed data set with a large num-

ber of rows but a small number of dimensions. Efficient pattern growth methods can

then be developed on such relatively low-dimensional data sets. The details of such an

approach are left as an exercise for interested readers.

The remainder of this section focuses on the second direction. We introduce Pattern-

Fusion, a new mining methodology that mines colossal patterns (i.e., patterns of very

long length). This method takes leaps in the pattern search space, leading to a good

approximation of the complete set of colossal frequent patterns.

7.4.1 Mining Colossal Patterns by Pattern-Fusion

Although we have studied methods for mining frequent patterns in various situations,

many applications have hidden patterns that are tough to mine, due mainly to their

immense length or size. Consider bioinformatics, for example, where a common activ-

ity is DNA or microarray data analysis. This involves mapping and analyzing very long

DNA and protein sequences. Researchers are more interested in finding large patterns

(e.g., long sequences) than finding small ones since larger patterns usually carry more

significant meaning. We call these large patterns colossal patterns, as distinguished from

patterns with large support sets. Finding colossal patterns is challenging because incre-

mental mining tends to get “trapped” by an explosive number of midsize patterns before

it can even reach candidate patterns of large size. This is illustrated in Example 7.10.

Example 7.10 The challenge of mining colossal patterns. Consider a 4040 square table where each

row contains the integers 1 through 40 in increasing order. Remove the integers on the

diagonal, and this gives a 4039 table. Add 20 identical rows to the bottom of the

table, where each row contains the integers 41 through 79 in increasing order, result-

ing in a 6039 table (Figure 7.6). We consider each row as a transaction and set the

minimum support threshold at 20. The table has an exponential number (i.e., 4 20

)

of midsize closed/maximal frequent patterns of size 20, but only one that is colossal:

D.

of size 39. None of the frequent pattern mining algorithms that

we have introduced so far can complete execution in a reasonable amount of time.

41, 42,

:::

, 79

/

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home