Information Technology Reference
In-Depth Information
Table 1. Number of documents classified in each topic
Topic
Number of documents
Achievement/career path
72
Announcement/request
275
Expense/benefits
85
Healthcare
351
School activities
201
Notification/certificate
87
Disaster/crime-prevention
51
Schooling system
77
Miscellaneous translations
73
case, the candidate in the list include only generic terms such as kenkou in
phonetic characters and kanzi characters, respectively in the first and the second
candidate in Figure 4(b). Differences in the variation and ranking of suggested
terms come from the domain-dependent term-weighting scheme behind the auto-
suggest interface, which is further explained as follows.
3.2 Domain Knowledge
The domain knowledge used for the auto-suggest feature has been developed
as part of the project MUSE, and consists of a subject classification scheme
of school documents, and vocabularies in school-education domain. The school
documents in the portal site are categorized by the scheme and displayed in
the topic facet shown in the upper left of Figure 2. The classification scheme
for school documents was created by means of card sorting and hierarchical
clustering methods, and consisted of the nine of top-level topics [3]. Table 1
shows the top-level topics and the number of documents classified into each
topic in the portal site. Note that some of the topic names are composed of
two (e.g., Announcement/request ). This is because they were aggregated as a
single topic name as results of hierarchical clustering, so that it can be avoided
to create isolated clusters that may include only a few documents in a topic.
Besides the document classification scheme, terms in school-education domain
have been collected and refined by school teachers and translation staff members
participating in the project. The domain vocabulary used in this study contains
2679 terms, which includes terms related to educational activities (361 terms),
school calendar (918 terms), children's healthcare (542 terms), and so on. With
reference to the domain vocabulary as well as a generic Japanese dictionary,
indexing terms are extracted from Japanese sentences in the parallel translation
of all the school documents, by using a morphological analyzer 6 . All the nouns
are extracted as indexing terms from the documents, but the terms simply con-
sist of a single (hiragana or kanzi) character are excluded as stop words. On
6 https://sen.dev.java.net/
Search WWH ::




Custom Search