Information Technology Reference
In-Depth Information
Table 6.1. Results from Topic Models Toolbox: science corpus, 50 topics, seed 1,
500 iteration, default alpha and beta.
TOPIC 2 0.0201963151 TOPIC 38 0.0214418635
earth 0.1373291184 light 0.1238061875
sun 0.0883152826 red 0.0339683946
solar 0.0454833721 color 0.0307797075
atmosphere 0.0418036547 white 0.0262046347
moon 0.0362104843 green 0.0230159476
surface 0.0181062747 radiation 0.0230159476
planet 0.0166343877 wavelengths 0.0230159476
center 0.0148681234
blue 0.0184408748
bodies 0.0147209347
dark 0.0178863206
tides 0.0139849912
visible 0.0170544891
planets 0.0133962364
spectrum 0.0151135492
gravitational 0.0125131042
absorbed 0.0149749106
system 0.0111884060
colors 0.0148362720
appear 0.0110412173
rays 0.0116475849
mass 0.0100108964
eyes 0.0108157535
core 0.0083918207
yellow 0.0105384764
space 0.0083918207
absorption 0.0102611992
times 0.0079502547
eye 0.0095680064
orbit 0.0073614999
pigment 0.0092907293
...
...
To measure the similarity between documents based on TM, the Kullback Liebler
distance (KL-distance: [27]) between two documents is recommended, rather than
the cosine (which, nevertheless, can be used). A document can be represented by a
set of probabilities that this document could contain topic i using the following
n
D t =
T it
(6.5)
i =1
where D t is the probability of topic t in the document D , T it is the probability of
topic t of the term i in the document D , and n is number of terms appearing in the
document D . The KL-distance between two documents (the similarity) is computed
as follows:
T
T
1
2
D 1 t log 2 ( D 1 t /D 2 t )+ 1
2
KL ( D 1 ,D 2) =
D 2 t log 2 ( D 2 t /D 1 t )
(6.6)
t =1
t =1
Constructing a TM matrix involves making choices regarding a number of fac-
tors, such as the number of topics, the seed for random number generation, alpha,
beta, and the number of iterations. We have explored these factors and constructed
a number of TM matrices in an effort to optimize the resulting matrix; however, for
this preliminary evaluation, we use a TM matrix of 50 topics and a seed of 1.
The first TM-based system we tried was simply used in place of the LSA-based
factors in the combined-system. The three benchmarks are still the same but sim-
 
Search WWH ::




Custom Search