Information Technology Reference
In-Depth Information
Fig. 2. Plate notation of NMSC model
z<k> ij is the topic for the j:th word in document i and the k:th level. w ij
denotes the actual word used and λ<k> i the sparseness regulatory parameters
for further analyzing the i:th topic in the k:th level. In addition, the circles in
shade indicate that the variables are observable, while the other empty circles
indicating the latent variables. The directed edges imply the dependency between
the variables.
According to the plate notation, its clear that any topic in a non-root level
not only depends on the document corpora and sparseness regulatory parameter
but also the topics of the upper level.
3 Experiment
In this section, we evaluate the proposed NMSC for the problem of document
representation. We implement document clustering and compare the results with
other typical clustering methods.
3.1 Data Corpora
We use the Reuters database as the document corpora. Each document in the
corpora has been manually labeled with one or a few topics and also indicated
which cluster it belongs to. The Reuters database contains 21578 documents and
135 topics, equally as the clusters. However, due to fair comparison with other
clustering methods with assumption that each document belongs to only one
cluster, weve removed the multi-topic documents and retained 9494 documents
within the 51 clusters each of which contains 5 documents at least.
Search WWH ::




Custom Search