Information Technology Reference
In-Depth Information
worked well as stand-ins for the set of causally related sentences. It should be noted
that this approach may not succeed so well with other genres, such as narrative or
history texts.
We tested several systems that combined the use of word-matching and LSA2
and the best one is LSA2/WB2-TT. In these combinatory systems, we combine
a weighted sum of the factors used in the fully automated word-based systems
and LSA2. These combinations allowed us to examine the benefits of using the
world knowledge benchmark (in LSA1) when LSA was combined with a fully auto-
mated word-based system and we found that world knowledge benchmark could be
dropped. Hence, only three benchmarks are used for LSA-based factors: 1) the words
in the title of the passage, 2) the words in the sentence, and 3) the words in the two
immediately prior sentences. From the word-based values we include 4) the number
of content words matched in the target sentence, 5) the number of content words
matched in the prior sentences, 6) the number of content words matched in the
subsequent sentences, and 7) the number of content words that were not matched in
4, 5, or 6. One further adjustment was made because we noticed that the LSA ap-
proach alone was better at predicting higher values correctly, while the word-based
approach was better at predicting lower values. Consequently, if the formulae of the
combined system predicted a score of 2 or 3, that value is used. However, if the sys-
tem predicted a 1, a formula from the word-based system is applied. Finally, level 0
was assigned to explanations that had negligible cosine matches with all three LSA
benchmarks.
6.2.3 Topic Models (TM) Feedback System
The Topic Models approach (TM; [10, 27]) applies a probabilistic model in finding
a relationship between terms and documents in terms of topics. A document is
conceived of as having been generated probabilistically from a number of topics and
each topic consists of number of terms, each given a probability of selection if that
topic is used. By using a TM matrix, we can estimate the probability that a certain
topic was used in the creation of a given document. If two documents are similar,
the estimates of the topics they probably contain should be similar. TM is very
similar to LSA, except that a term-document frequency matrix is factored into two
matrices instead of three.
{X normalized } = {W }{D}
(6.4)
The dimension of matrix {W } is W x T , where W is the number of words in the
corpus and T is number of topics. The number of topics varies, more or less, with the
size of corpus; for example, a corpus of 8,000 documents may require only 50 topics
while a corpus of 40,000 documents could require about 300 topics. We use the TM
Toolbox [28] to generate the {W } or TM matrix, using the same science corpus as
we used for the LSA matrix. In this construction, the matrix {X} is for all terms in
the corpus, not just those appearing in two different documents. Although matrix
{X} is supposed to be normalized, the TM toolbox takes care of this normalization
and outputs for each topic, the topic probability, and a list of terms in this topic
along with their probabilities in descending order (shown in Table 6.1). This output
is easily transformed into the term-topic-probability matrix.
Search WWH ::




Custom Search