Evaluating Self-Explanations in iSTART: Word Matching, Latent Semantic Analysis, and Topic Models - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

worked well as stand-ins for the set of causally related sentences. It should be noted

that this approach may not succeed so well with other genres, such as narrative or

history texts.

We tested several systems that combined the use of word-matching and LSA2

and the best one is LSA2/WB2-TT. In these combinatory systems, we combine

a weighted sum of the factors used in the fully automated word-based systems

and LSA2. These combinations allowed us to examine the benefits of using the

world knowledge benchmark (in LSA1) when LSA was combined with a fully auto-

mated word-based system and we found that world knowledge benchmark could be

dropped. Hence, only three benchmarks are used for LSA-based factors: 1) the words

in the title of the passage, 2) the words in the sentence, and 3) the words in the two

immediately prior sentences. From the word-based values we include 4) the number

of content words matched in the target sentence, 5) the number of content words

matched in the prior sentences, 6) the number of content words matched in the

subsequent sentences, and 7) the number of content words that were not matched in

4, 5, or 6. One further adjustment was made because we noticed that the LSA ap-

proach alone was better at predicting higher values correctly, while the word-based

approach was better at predicting lower values. Consequently, if the formulae of the

combined system predicted a score of 2 or 3, that value is used. However, if the sys-

tem predicted a 1, a formula from the word-based system is applied. Finally, level 0

was assigned to explanations that had negligible cosine matches with all three LSA

benchmarks.

6.2.3 Topic Models (TM) Feedback System

The Topic Models approach (TM; [10, 27]) applies a probabilistic model in finding

a relationship between terms and documents in terms of topics. A document is

conceived of as having been generated probabilistically from a number of topics and

each topic consists of number of terms, each given a probability of selection if that

topic is used. By using a TM matrix, we can estimate the probability that a certain

topic was used in the creation of a given document. If two documents are similar,

the estimates of the topics they probably contain should be similar. TM is very

similar to LSA, except that a term-document frequency matrix is factored into two

matrices instead of three.

{X normalized } = {W }{D}

(6.4)

The dimension of matrix {W } is W x T , where W is the number of words in the

corpus and T is number of topics. The number of topics varies, more or less, with the

size of corpus; for example, a corpus of 8,000 documents may require only 50 topics

while a corpus of 40,000 documents could require about 300 topics. We use the TM

Toolbox [28] to generate the {W } or TM matrix, using the same science corpus as

we used for the LSA matrix. In this construction, the matrix {X} is for all terms in

the corpus, not just those appearing in two different documents. Although matrix

{X} is supposed to be normalized, the TM toolbox takes care of this normalization

and outputs for each topic, the topic probability, and a list of terms in this topic

along with their probabilities in descending order (shown in Table 6.1). This output

is easily transformed into the term-topic-probability matrix.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home