Information Technology Reference
In-Depth Information
variables with the knowledge in previous step. According to the characters of
these two steps, we define two likelihood functions, and use EM algorithm to
find the local optimal solution with the maximum likelihood. This approach on
the one hand avoids blind search in the solution space like unsupervised learning;
on the other hand it requires only some class variables but not large amount of
labeled training samples. It will release website managers from fussy training
document labeling and improve the efficiency of web page automatic
classification. To distinguish with the supervised learning and unsupervised
learning, this approach is named semi-supervised learning.
The basic idea of latent semantic analysis (LSA) is to project the documents in
high dimensional vector space model (VSM) to a low dimensional latent
semantic space. This projection is performed via singular value decomposition
(SVD) on entry/document matrix
N m × n . Concretely, according to linear algebra,
any matrix Nm*n can be decomposed as follows:
N U V T (6.42)
where U , V are orthogonal matrixes ( UU T = VV T = I ); =
Ã
=
a 2 , ? , a k ,
? , a v ) ( a 1 , a 2 , ? , a v are singular values) is a diagonal matrix. In latent semantic
analysis, the approximation is gained by keeping k biggest singular values and
setting others to 0:
diag
(
a 1 ,
! T T (6.43)
Because the similarity between two documents can be represented with
2
N
=
U
Ã
V
U
Ã
V
=
N
! T T T , the coordinate of a document in the latent semantic
space can be approximated by U ! . After projecting the representation of a
document from high dimensional space to low dimensional semantic space, the
sparsity of data, which exists in high dimensional space, does not exist any more
in the low dimensional latent semantic space. This also indicates that even if
there is no common factor between two documents in high dimensional space,
we may still find their meaningful connections in low dimensional semantic
space.
After the SVD and projecting documents from high dimensional space to low
dimensional latent semantic space, the scale of a problem is effectively reduced.
LSA has been successfully applied to many fields, including information filtering,
text indexing and video retrieval. Yet SVD is sensitive to variation of data, and
seems stiff when prior information is lack. These shortcomings limit its
application.
According to our experiences, description on any problem is developed
centering on some theme. There are relative obvious boundaries between
different themes. Because of differences in personal favors and interests,
people's concerns on different themes vary. There is prior knowledge in different
!
Ã
NN
NN
=
U
U
Search WWH ::




Custom Search