Information Technology Reference
In-Depth Information
vector is the weight of term i in document j . If
there are D documents in the collection, a T x
D term-by-document matrix will represent the
whole collection. The columns of this matrix are
vectors representing the documents, in terms of
their composition by terms. Every document's
position in the term-document space represents
the content of the document.
In information retrieval systems, geometric
relationships between document vectors are used
to calculate the similarities of document vectors
in content. The most commonly used measure of
similarity is the cosine of the angle between the
two document vectors.
When a user inputs a query, the query is treated
just like another document and is represented as
a vector too. So the similarity between the query
and any document can be calculated the same
way as between two documents. Documents
will be ranked according to their similarity to
the query.
We propose to represent QAs and CAs by
several groups of linguistic features. In contrast
to the concept of bag of word representation of
document topic, each linguistic feature itself
may not represent a dimension in the space of
QAs or CAs. How to calculate the similarity of
two documents on QA or CA will depend on the
models learned.
For a linear model, each QA must be reduced
to a number. The angle between two documents'
QA scores can be used as similarity measure.
For rule-based models, each document will
be classified into a definite group of QA or CA.
The similarity assessment problem then turns into
binary classification problem.
number of statistical classification and machine
learning techniques have been applied to text cat-
egorization. Most of these applications are based
on document content. Our work has proved that
multiple learning methods can be used to classify
documents based on their QA or CA features.
Classification is the process of trying to predict
the category for unknown data, given existing
classified data. Typically the data set is divided
into a training data set and a test data set. Elements
of a training data set are described by a set of
independent features and a target variable whose
value is available. A machine learning algorithm
is applied to the training data set iteratively to
identify patterns of features in the training data
set. This is usually repeated many times until the
error is reduced below some threshold. The pat-
terns might include many features in the set or
only a few of them. The produced pattern must
be represented in some type of model, such as a
decision tree. Once a pattern is chosen, the test
data (unknown to the algorithm) are run through
the pattern, and the error rate is recorded. Again,
this is usually repeated several times with differ-
ent test sets, to get an average error rate. With a
collection of hand-tagged texts on QAs and CAs, a
supervised learning method will build classifiers,
and then the resulting models will be evaluated
on new test cases.
The second issue of machine learning is that
the system must learn how to represent and match
users' quests to stored quests. This is similar to
the adaptive filtering problem, where incoming
messages are represented, typically, by very large
vectors, and the underlying assumption is that
some connected and “well shaped” subset of these
vectors are the ones that should be transmitted
for further human review. (This problem is dual
to the increasingly common Spam problem.)
However, there is a cost associated with verify-
ing that a specific message should indeed have
been transmitted. At any moment in time, the
sensor module has a “current belief” about which
messages should be transmitted. But strictly fol-
mAChine leArning issues
Machine learning is a key component to quest
presentation and similarity measurement. The
first machine learning issue is classification of
quests based on their QAs or CAs. A growing
Search WWH ::




Custom Search