Evaluating Self-Explanations in iSTART: Word Matching, Latent Semantic Analysis, and Topic Models - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

length. The overall quality of each self-explanation (1, 2, or 3) is still computed with

the same formula used in WB-TT.

6.2.2 Latent Semantic Analysis (LSA) Feedback Systems

Latent Semantic Analysis (LSA; [13, 14]) uses statistical computations to extract

and represent the meaning of words. Meanings are represented in terms of their

similarity to other words in a large corpus of documents. LSA begins by finding

the frequency of terms used and the number of co-occurrences in each document

throughout the corpus and then uses a powerful mathematical transformation to

find deeper meanings and relations among words. When measuring the similarity

between text-objects, LSA's accuracy improves with the size of the objects. Hence,

LSA provides the most benefit in finding similarity between two documents. The

method, unfortunately, does not take into account word order; hence, very short

documents may not be able to receive the full benefit of LSA.

To construct an LSA corpus matrix, a collection of documents are selected. A

document may be a sentence, a paragraph, or larger unit of text. A term-document-

frequency (TDF) matrix X is created for those terms that appear in two or more

documents. The row entities correspond to the words or terms (hence the W ) and

the column entities correspond to the documents (hence the D ). The matrix is

then analyzed using Singular Value Decomposition (SVD; [26]), that is the TDF

matrix X is decomposed into the product of three other matrices: (1) vectors of

derived orthogonal factor values of the original row entities W, (2) vectors of derived

orthogonal factor values of the original column entities D, and (3) scaling values

(which is a diagonal matrix) S. The product of these three matrices is the original

TDF matrix.

{X} = {W }{S}{D}

(6.1)

The dimension ( d )of {S} significantly affects the effectiveness of the LSA space

for any particular application. There is no definite formula for finding an optimal

number of dimensions; the dimensionality can be determined by sampling the results

of using the matrix {W }{S} to determine the similarity of previously-evaluated

document pairs for different dimensionalities of {S} . The optimal size is usually in

the range of 300-400 dimensions.

The similarity of terms is computed by taking the cosine of the corresponding

term vectors. A term vector is the row entity of that term in the matrix W. In

iSTART, the documents are sentences from texts and trainees' explanations of those

sentences. These documents consist of terms, which are represented by term vectors;

hence, the document can be represented as a document vector which is computed

as the sum of the term vectors of its terms:

n

D i =

T ti

(6.2)

t =1

where D i is the vector for the i th document D , T ti is the term vector for the term t

in D i , and n is number of terms in D. The similarity between two documents (i.e.,

the cosine between the two document vectors) is computed as

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home