Neural Networks: An Overview - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

text, or of its frequency in the text. Clearly, the main di culty is the dimen-

sion of that vector, which is, in principle, equal to the number of words in the

vocabulary. Nevertheless, all words are not equally discriminant: most fre-

quent words (of, the, and) are not useful for discrimination, nor are very rare

words. Therefore, the first step of the design of a filter is the determination

of the vocabulary that is specific to the topic.

Word Encoding

The words are encoded in the following way: we denote by FT ( m,t ) the fre-

quency of occurrence of word m in text t ,andby FT ( t ) the average frequency

of the terms in text t . Then the word m is described by [Singhal 1996]

x ( m )= 1+log( FT ( m,t ))

1+log( FT ( t )) .

Zipf's Law

Zipf's law [Zipf 1949] is helpful for finding discriminant words: given a corpus

of T texts, we denote by FC ( m ) the frequency of occurrence of word m in

corpus T . A list of words, ranked in order of decreasing values of FC ( m ), is

built; we denote the rank of word m in that list by r ( m ). Zipf's law states

that FC ( m ) r ( m )= K ,where K is a corpus-dependent quantity. Hence, there

is a very small number of very frequent words, and a large number of very

rare words that occur once or twice in the corpus; between those extremes,

there is a set of words in which discriminant words ought to be sought.

Extraction of the Specific Vocabulary

In order to extract the vocabulary that is specific to the topic, the ratio

R ( m,t )= FT ( m,t ) /FC ( m ) is computed for each word m of each relevant text

t . The words of the text are ranked in order of decreasing values of R ( m,t ),

the second half of the list is deleted, and a boolean vector v ( t ) is defined, such

that v i ( t ) = 1 if word i is present in the list, 0 otherwise. Finally, the vector

v = t v ( t ), is computed, where the summation is performed on all relevant

documents: the specific vocabulary of the topic is the set of words that have a

nonzero component in vector v . Figure 1.38 shows that Zipf's law is obeyed on

the corpus of Reuters releases, and that the words of the vocabulary specific

to the topic Falkland petroleum exploration are indeed located in the middle

of the distribution.

Final Selection

Within the specific vocabulary thus defined, which may be still large (one

to several hundred words), a final selection is performed by the probe feature

Search WWH ::

Custom Search

Home