Information Technology Reference
In-Depth Information
Qualitative Aspects (QAs)
with state of the art learning methods to optimize
the matching of a present quest to one or more of
the existing quests.
Since qualitative aspects are, in principle, content-
independent, content-independent features should
be used to represent them. Luckily, a document is
more than a bag of words. Some natural language
features are promising candidates.
Using language features to differentiate docu-
ments on dimensions other than topic has long
been the focus in computational stylistic studies.
The assumption of stylistics research is that the
style of any text implies the choice of words and
the choices of arrangement and punctuation of
words that are actually used in a document. In
turn, identifying the “style markers” (words and
patterns) supports categorizing or identifying
documents with particular styles. Various types of
computable linguistic features have been proposed
as such markers.
In an early review of the analysis of literary
style, Holmes (1985) lists a number of possible
features that can be used in analysis of authorship.
These include:
how to represent quests
Bag of Words
Even though many researchers have pointed
out dimensions other than the topic in users'
information needs, a general, task-independent
representation of text contents still is the primary
representation in all information retrieval systems.
This representation is based on the occurrence
and/or frequency of words, phrases. The basic
assumption is that the presence or absence of
words in a text is an indication of topic.
Frames
We use concept of frames as described by the
builders of the HITIQA system (Small teal.
2004). A frame is an event or relation expressed
in a piece of text. Entities involved in the event
or relation make up the frame's attributes, such
as location, person, organization, date, etc. The
HITIQA system uses BBN's Identifinder to extract
attributes from text passages. The central verb or
noun phrase of the passage is put in the TOPIC
attribute of the frame, which indicates the event
or relation, such as accident, trade, etc. Some
extension of the basic frame concept generates
specialized typed frames. For example a transfer
frame must have three attributes; TO, FROM and
OBJECT.
In other words, a frame is a partially structured
representation of text. It gives a deeper under-
standing of a text than what can be expressed by
a bag of words, by making use of the semantic
functions of words.
Word-length: frequency, distributions
Syllables: average syllables per word, dis-
tribution of syllables per word
Sentence-length
Distribution of parts of speech
Function words
According to Rudman, over 1,000 linguistic
features have been proposed (Rudman, 1997).
Tweedie et al. also list a variety of linguistic fea-
tures that can be used as style markers (Tweedie
et al, 1998). Chaski's work includes word length,
vocabulary richness, frequency of function words,
punctuation marks etc as common “style markers”
(Chaski, 2001).Argamon has applied such features
to identify the sex of authors, and has compiled a
very extensive classification of stylistic features.
(Argamon, 2003).
The variety of features described above in-
dicates that there is some success but there is
Search WWH ::




Custom Search