Databases Reference
In-Depth Information
/D j D \ D j
j D j
/D j D jj D \ D j
j D j
where
P
.
x D 1, y D 1
,
P
.
x D 0, y D 1
,
P
.
x D 1, y D 0
/D
j D
jj D
\ D
j
/D j D jj D [ D j
j D j
, and P
.
x D 0, y D 0
. Standard Laplace smoothing can be
j D j
used to avoid zero probability.
Mutual information favors strongly correlated units and thus can be used to model
the indicative strength of the context units selected. With context modeling, pattern
annotation can be accomplished as follows:
1. To extract the most significant context indicators, we can use cosine similarity
(Chapter 2) to measure the semantic similarity between pairs of context vectors, rank
the context indicators by the weight strength, and extract the strongest ones.
2. To extract representative transactions, represent each transaction as a context vector.
Rank the transactions with semantic similarity to the pattern p .
3. To extract semantically similar patterns, rank each frequent pattern, p , by the seman-
tic similarity between their context models and the context of p .
Based on these principles, experiments have been conducted on large data sets to
generate semantic annotations. Example 7.16 illustrates one such experiment.
Example 7.16 Semantic annotations generated for frequent patterns from the DBLP Computer Sci-
ence Bibliography. Table 7.4 shows annotations generated for frequent patterns from a
portion of the DBLP data set. 3 The DBLP data set contains papers from the proceed-
ings of 12 major conferences in the fields of database systems, information retrieval,
and data mining. Each transaction consists of two parts: the authors and the title of the
corresponding paper.
Consider two types of patterns: (1) frequent author or coauthorship , each of which
is a frequent itemset of authors, and (2) frequent title terms , each of which is a fre-
quent sequential pattern of the title words. The method can automatically generate
dictionary-like annotations for different kinds of frequent patterns. For frequent item-
sets like coauthorship or single authors, the strongest context indicators are usually the
other coauthors and discriminative title terms that appear in their work. The semanti-
cally similar patterns extracted also reflect the authors and terms related to their work.
However, these similar patterns may not even co-occur with the given pattern in a paper.
For example, the patterns “ timos k selli ,” “ ramakrishnan srikant ,” and so on, do not co-
occur with the pattern “ christos faloutsos ,” but are extracted because their contexts are
similar since they all are database and/or data mining researchers; thus the annotation
is meaningful.
For the title term “ information retrieval ,” which is a sequential pattern, its strongest
context indicators are usually the authors who tend to use the term in the titles of their
papers, or the terms that tend to coappear with it. Its semantically similar patterns usu-
ally provide interesting concepts or descriptive terms, which are close in meaning (e.g.,
information retrieval ! information filter ).”
 
Search WWH ::




Custom Search