Information Technology Reference
In-Depth Information
length. The overall quality of each self-explanation (1, 2, or 3) is still computed with
the same formula used in WB-TT.
6.2.2 Latent Semantic Analysis (LSA) Feedback Systems
Latent Semantic Analysis (LSA; [13, 14]) uses statistical computations to extract
and represent the meaning of words. Meanings are represented in terms of their
similarity to other words in a large corpus of documents. LSA begins by finding
the frequency of terms used and the number of co-occurrences in each document
throughout the corpus and then uses a powerful mathematical transformation to
find deeper meanings and relations among words. When measuring the similarity
between text-objects, LSA's accuracy improves with the size of the objects. Hence,
LSA provides the most benefit in finding similarity between two documents. The
method, unfortunately, does not take into account word order; hence, very short
documents may not be able to receive the full benefit of LSA.
To construct an LSA corpus matrix, a collection of documents are selected. A
document may be a sentence, a paragraph, or larger unit of text. A term-document-
frequency (TDF) matrix X is created for those terms that appear in two or more
documents. The row entities correspond to the words or terms (hence the W ) and
the column entities correspond to the documents (hence the D ). The matrix is
then analyzed using Singular Value Decomposition (SVD; [26]), that is the TDF
matrix X is decomposed into the product of three other matrices: (1) vectors of
derived orthogonal factor values of the original row entities W, (2) vectors of derived
orthogonal factor values of the original column entities D, and (3) scaling values
(which is a diagonal matrix) S. The product of these three matrices is the original
TDF matrix.
{X} = {W }{S}{D}
(6.1)
The dimension ( d )of {S} significantly affects the effectiveness of the LSA space
for any particular application. There is no definite formula for finding an optimal
number of dimensions; the dimensionality can be determined by sampling the results
of using the matrix {W }{S} to determine the similarity of previously-evaluated
document pairs for different dimensionalities of {S} . The optimal size is usually in
the range of 300-400 dimensions.
The similarity of terms is computed by taking the cosine of the corresponding
term vectors. A term vector is the row entity of that term in the matrix W. In
iSTART, the documents are sentences from texts and trainees' explanations of those
sentences. These documents consist of terms, which are represented by term vectors;
hence, the document can be represented as a document vector which is computed
as the sum of the term vectors of its terms:
n
D i =
T ti
(6.2)
t =1
where D i is the vector for the i th document D , T ti is the term vector for the term t
in D i , and n is number of terms in D. The similarity between two documents (i.e.,
the cosine between the two document vectors) is computed as
Search WWH ::




Custom Search