Information Technology Reference
In-Depth Information
no consensus on which features are best. The
separation of non-topical document qualitative
aspects from topicality is essentially similar to the
separation of “style” and content in the context of
computational stylistic studies. Therefore stylistic
studies form a reasonable starting point to iden-
tify possible qualitative aspects indicators from
the sets of “style markers”. Our work has shown
that some of the “style markers” are promising
indicators of document qualitative aspects (Sun,
Dissertation, Fall 2005).
As mentioned above, we will use GATE to
obtain the Part Of Speech frequencies. We will
also use GATE's extensible Gazetteer function to
count the frequencies of lists of words (entities
and declarative words). GATE itself has some
default entity lists, such as person, location, date ,
etc. We also can create our own expanded entity
lists. We will use WordNet to find all hypernyms
of all words in one GATE default list. Then we
combine the two sets of words together and
remove duplicates to form our expended entity
list. WordNet is a lexical reference system that
organizes English nouns, verbs, adjectives, and
adverbs into synonym sets (Fellbaum 1998). It is
developed and maintained by the Cognitive Sci-
ence Laboratory at Princeton University.
We will also use WordNet to obtain various
declarative word lists. The general method, used
in our earlier work, starts from a list created by
a human expert, who examines pieces of texts
which are saved as evidence in support of user
judgments about qualitative or content aspects. We
call these “Revealed Indicators”, represented as a
list R . We then use WordNet to get all ancestors
of the words in the original list. Those words that
appear in both the original list and the ancestor
list form a list called R + . Those words that show
up only in the ancestor list, and not in the original
list form a list called R - .
For other features, we use Perl Scripts to cal-
culate the location and frequency of the features.
This work will build on an extensive library of
scripts developed in the HITIQA project, over the
past three years (Ng, et al, 2003; Bai, et al, 2004;
Rittman, et al 2004).
Content Aspects (CAs)
Content aspects are different from document
topics. However their relationships with docu-
ments contents are not quite as “orthogonal” as
are qualitative aspects. In the proposed research
we will build on techniques developed in the
HITIQA project.
We propose to identify indicators from four
sources: Naïve Word Lists . We are presently
creating lists of words for each of several content
aspects (military, scientific-technical, biographi-
cal, etc.). The words in each list must satisfy the
condition that they represent the corresponding
content aspect and discriminate it from other as-
pects. Named Identities. The assumption is that
the frequencies of some types of named identities
may vary a lot among different aspects. Adjective
Classes . In our previous work, we have accumu-
lated several classes of adjectives. The categories
of these classes are related with content aspects.
Style Markers. A particular content aspect may be
associated with a particular style. For example, a
text with science-technology perspective is more
likely to be objective. We propose that some style
markers may be good indicators of CAs. In all
of this we will build on knowledge gained in the
work with the HITIQA system.
hoW to Assess similArity
The vector space model is used when documents
are represented as a bag of words. In the vector
space model, each document in the collection is
represented as a vector with components labeled
by terms. Each term in the vector has its own
weight to reflect how important it is in describ-
ing the content of the document. If the collection
contains T index terms, each document will be
a T T-dimensional vector. The element w i,j in the
Search WWH ::




Custom Search