Databases Reference
In-Depth Information
of probabilities. The most simple and sophisticated weighted schema which
is most common used in information retrieval or information extraction is
TFIDF indexing, i.e., tf
idf indexing [20, 21], where tf denotes term fre-
quency that appears in the document and idf denotes inverse document fre-
quency where document frequency is the number of documents which contain
the term. It takes effect on the commonly used word a relatively small tf
×
×
idf
value. Moffat and Zobel [17] pointed out that tf
idf function demonstrates:
(1) rare terms are no less important than frequent terms in according to their
idf values; (2) multiple appearances of a term in a document are no less im-
portant than single appearances in according to their tf values. The tf
×
idf
implies the significance of a term in a document, which can be defined as
follows.
We observed that the direction of key terms (including compound words)
is irrelevant information for the purpose of document clustering. So we ignore
the confidence and consider only the support . In other words, we consider the
structure of the undirected associations of key terms; we believe the set of key
terms that co-occur reflects the essential information, the rule directions of
the key terms are inessential, at least in the present stage of investigation. Let
t A and t B be two terms. The support is defined for a collection of documents
as follows.
×
Definition 1. The significance of undirected associations of term t A and term
t B in a collection is:
| T r |
1
significance(t A , t B , T r )=
significance(t A , t B , d i )
|
T r |
i=0
where
|
T r |
significance(t A , t B , d i ) = tf(t A , t B , d i )log
,
|
T r (t A , t B )
|
|
T r ( t A ,t B )
|
defines number of documents contained both term t A and term t B ,
and
|
T r |
denotes the number of Web pages in a collection.
The term frequency tf(t A , t B , d i ) of both term t A and t B can be calculated as
follows.
Definition 2.
1 + log(min
)
if N ( t A ,d j ) > 0 and N ( t B ,d j ) > 0
{
N ( t A ,d j ) ,N ( t B ,d j )
}
tf(t A , t B , d j )=
0
otherwise.
A minimal threshold θ is imposed to filter out the terms that their signif-
icance values are small. It helps us to eliminate the most common terms in a
collection and the nonspecific terms in a document.
 
Search WWH ::




Custom Search