Advanced Pattern Mining - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

/D j D \ D j

j D j

/D j D jj D \ D j

j D j

where

P

.

x D 1, y D 1

,

P

.

x D 0, y D 1

,

P

.

x D 1, y D 0

/D

j D

jj D

\ D

j

/D j D jj D [ D j

j D j

, and P

.

x D 0, y D 0

. Standard Laplace smoothing can be

j D j

used to avoid zero probability.

Mutual information favors strongly correlated units and thus can be used to model

the indicative strength of the context units selected. With context modeling, pattern

annotation can be accomplished as follows:

1. To extract the most significant context indicators, we can use cosine similarity

(Chapter 2) to measure the semantic similarity between pairs of context vectors, rank

the context indicators by the weight strength, and extract the strongest ones.

2. To extract representative transactions, represent each transaction as a context vector.

Rank the transactions with semantic similarity to the pattern p .

3. To extract semantically similar patterns, rank each frequent pattern, p , by the seman-

tic similarity between their context models and the context of p .

Based on these principles, experiments have been conducted on large data sets to

generate semantic annotations. Example 7.16 illustrates one such experiment.

Example 7.16 Semantic annotations generated for frequent patterns from the DBLP Computer Sci-

ence Bibliography. Table 7.4 shows annotations generated for frequent patterns from a

portion of the DBLP data set. 3 The DBLP data set contains papers from the proceed-

ings of 12 major conferences in the fields of database systems, information retrieval,

and data mining. Each transaction consists of two parts: the authors and the title of the

corresponding paper.

Consider two types of patterns: (1) frequent author or coauthorship , each of which

is a frequent itemset of authors, and (2) frequent title terms , each of which is a fre-

quent sequential pattern of the title words. The method can automatically generate

dictionary-like annotations for different kinds of frequent patterns. For frequent item-

sets like coauthorship or single authors, the strongest context indicators are usually the

other coauthors and discriminative title terms that appear in their work. The semanti-

cally similar patterns extracted also reflect the authors and terms related to their work.

However, these similar patterns may not even co-occur with the given pattern in a paper.

For example, the patterns “ timos k selli ,” “ ramakrishnan srikant ,” and so on, do not co-

occur with the pattern “ christos faloutsos ,” but are extracted because their contexts are

similar since they all are database and/or data mining researchers; thus the annotation

is meaningful.

For the title term “ information retrieval ,” which is a sequential pattern, its strongest

context indicators are usually the authors who tend to use the term in the titles of their

papers, or the terms that tend to coappear with it. Its semantically similar patterns usu-

ally provide interesting concepts or descriptive terms, which are close in meaning (e.g.,

“ information retrieval ! information filter ).”

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home