Information Technology Reference
In-Depth Information
treats each document as a pool of terms. However, it applies two pre-processing
procedures in order to enhance the quality of the indexing procedure.
Firstly, a number of words, called stop-words , such as prepositions, pronouns and
conjunctions, are commonly found in documents and carry no semantic information
for the comprehension of the document theme (Fox, 1989). Such words are excluded
from further analysis as they do not provide meaningful information and are likely
to distort the similarity measure. We used a list stop-words provided by Fox (1989).
Secondly, the remaining terms are reduced to their root words through stemming
algorithms. For instance, the terms “usability” and “usable” are reduced to the term
“usabl”, thus allowing the indexing of multiple forms of a word under one dimension
in the vector-space model. We employed Porter's (1980) algorithm for stemming.
6.2.1.2
Normalizing Impact of Terms
The first step in the procedure has resulted in a nxm matrix A where each element
a i,j depicts the number of times that the stemmed term i appears in document j .The
frequencies of different terms across different documents will vary substantially.
This results in undesired impacts of terms that are more frequent across a larger
set of documents as they receive higher weight than terms that appear in only a
small set of documents. However, these terms that appear in many documents have
limited discriminatory power and are thus not very informative. One term-weighting
criterion that counterbalances for this inherent bias is the term-frequency inverse-
document frequency (TFIDF) (Salton and Buckley, 1988):
a i , j weighted =
a i , j
log
(
nDocs
/
nDocs i )
(6.1)
which weights the frequency a i,j by the logarithm of the ratio of the total number of
documents nDocs by the number of documents nDocs i in which the term i appears.
Thus, frequent terms that appear in a large amount of documents and thus have little
discriminatory power receive lower weight in the final matrix.
6.2.1.3
Dimensionality Reduction
Matrix A is sparse and high-dimensional. Moreover, certain groups of terms may
display similar distributions across the different documents, thus underlying a single
latent variable. LSA attempts to approximate A by a matrix of lower rank. Singular
Value Decomposition is used to decompose matrix A in three matrices U, S, V in
that
USV T
A
=
(6.2)
Matrices U and V are orthonormal matrices and S is a diagonal matrix that contains
the singular values of A. Singular values are ordered in decreasing size in matrix
S, thus by taking the first kxk submatrix of S, we approximate A by its best-fit of
rank k.
U nk S kk V mk
A k =
(6.3)
Search WWH ::




Custom Search