Information Technology Reference
In-Depth Information
Fig. 1.38. An experimental verification of Zipf's law on the Reuters corpus, and
location of the words of the vocabulary specific to the topic “Falkland petroleum
exploration”
method, described in Chap. 2. After completion of that step, it turns out that,
on the average over 500 different topics, the size of the specific vocabulary of
a given topic is 25 words, which is a reasonable dimension for the input vector
of a neural network. That representation, however, is not fully satisfactory
yet. Since isolated words are ambiguous in such an application, the context
must be taken into account.
1.4.5.2 Context Determination
In order to take into account the context in the representation of the texts,
context words are sought in a window of 5 words on both sides of each word
of the specific vocabulary.
The words that are in the vicinity of the words of the specific vocabulary,
in relevant texts, are defined as positive context words.
The words that are in the vicinity of the words of the specific vocabulary,
in irrelevant texts, are defined as negative context words.
In order to select the context words, the procedure that is used is identical
to the selection procedure for the specific vocabulary. On the average over
500 topics, a topic is defined by 25 specific words, each of which having 3
context words.
1.4.5.3 Filter Design and Training
Filters Without Context
If the context is not taken into account, the inputs of the filter are the words
of the specific vocabulary, encoded as indicated above. In accordance with the
classifier design methodology described above, the structure of the classifier
depends on the complexity of the discrimination problem. On the corpuses
tested in the course of the development of the present application, the ex-
amples are linearly separable, so that networks made of a single neuron with
sigmoid activation function solve the problem.
Search WWH ::




Custom Search