Information Technology Reference
In-Depth Information
text, or of its frequency in the text. Clearly, the main di culty is the dimen-
sion of that vector, which is, in principle, equal to the number of words in the
vocabulary. Nevertheless, all words are not equally discriminant: most fre-
quent words (of, the, and) are not useful for discrimination, nor are very rare
words. Therefore, the first step of the design of a filter is the determination
of the vocabulary that is specific to the topic.
Word Encoding
The words are encoded in the following way: we denote by FT ( m,t ) the fre-
quency of occurrence of word m in text t ,andby FT ( t ) the average frequency
of the terms in text t . Then the word m is described by [Singhal 1996]
x ( m )= 1+log( FT ( m,t ))
1+log( FT ( t )) .
Zipf's Law
Zipf's law [Zipf 1949] is helpful for finding discriminant words: given a corpus
of T texts, we denote by FC ( m ) the frequency of occurrence of word m in
corpus T . A list of words, ranked in order of decreasing values of FC ( m ), is
built; we denote the rank of word m in that list by r ( m ). Zipf's law states
that FC ( m ) r ( m )= K ,where K is a corpus-dependent quantity. Hence, there
is a very small number of very frequent words, and a large number of very
rare words that occur once or twice in the corpus; between those extremes,
there is a set of words in which discriminant words ought to be sought.
Extraction of the Specific Vocabulary
In order to extract the vocabulary that is specific to the topic, the ratio
R ( m,t )= FT ( m,t ) /FC ( m ) is computed for each word m of each relevant text
t . The words of the text are ranked in order of decreasing values of R ( m,t ),
the second half of the list is deleted, and a boolean vector v ( t ) is defined, such
that v i ( t ) = 1 if word i is present in the list, 0 otherwise. Finally, the vector
v = t v ( t ), is computed, where the summation is performed on all relevant
documents: the specific vocabulary of the topic is the set of words that have a
nonzero component in vector v . Figure 1.38 shows that Zipf's law is obeyed on
the corpus of Reuters releases, and that the words of the vocabulary specific
to the topic Falkland petroleum exploration are indeed located in the middle
of the distribution.
Final Selection
Within the specific vocabulary thus defined, which may be still large (one
to several hundred words), a final selection is performed by the probe feature
Search WWH ::




Custom Search