Collaborative Retrieval Systems - Collaborative Technologies and Applications for Interactive Information Design

Information Technology Reference

In-Depth Information

vector is the weight of term i in document j . If

there are D documents in the collection, a T x

D term-by-document matrix will represent the

whole collection. The columns of this matrix are

vectors representing the documents, in terms of

their composition by terms. Every document's

position in the term-document space represents

the content of the document.

In information retrieval systems, geometric

relationships between document vectors are used

to calculate the similarities of document vectors

in content. The most commonly used measure of

similarity is the cosine of the angle between the

two document vectors.

When a user inputs a query, the query is treated

just like another document and is represented as

a vector too. So the similarity between the query

and any document can be calculated the same

way as between two documents. Documents

will be ranked according to their similarity to

the query.

We propose to represent QAs and CAs by

several groups of linguistic features. In contrast

to the concept of bag of word representation of

document topic, each linguistic feature itself

may not represent a dimension in the space of

QAs or CAs. How to calculate the similarity of

two documents on QA or CA will depend on the

models learned.

For a linear model, each QA must be reduced

to a number. The angle between two documents'

QA scores can be used as similarity measure.

For rule-based models, each document will

be classified into a definite group of QA or CA.

The similarity assessment problem then turns into

binary classification problem.

number of statistical classification and machine

learning techniques have been applied to text cat-

egorization. Most of these applications are based

on document content. Our work has proved that

multiple learning methods can be used to classify

documents based on their QA or CA features.

Classification is the process of trying to predict

the category for unknown data, given existing

classified data. Typically the data set is divided

into a training data set and a test data set. Elements

of a training data set are described by a set of

independent features and a target variable whose

value is available. A machine learning algorithm

is applied to the training data set iteratively to

identify patterns of features in the training data

set. This is usually repeated many times until the

error is reduced below some threshold. The pat-

terns might include many features in the set or

only a few of them. The produced pattern must

be represented in some type of model, such as a

decision tree. Once a pattern is chosen, the test

data (unknown to the algorithm) are run through

the pattern, and the error rate is recorded. Again,

this is usually repeated several times with differ-

ent test sets, to get an average error rate. With a

collection of hand-tagged texts on QAs and CAs, a

supervised learning method will build classifiers,

and then the resulting models will be evaluated

on new test cases.

The second issue of machine learning is that

the system must learn how to represent and match

users' quests to stored quests. This is similar to

the adaptive filtering problem, where incoming

messages are represented, typically, by very large

vectors, and the underlying assumption is that

some connected and “well shaped” subset of these

vectors are the ones that should be transmitted

for further human review. (This problem is dual

to the increasingly common Spam problem.)

However, there is a cost associated with verify-

ing that a specific message should indeed have

been transmitted. At any moment in time, the

sensor module has a “current belief” about which

messages should be transmitted. But strictly fol-

mAChine leArning issues

Machine learning is a key component to quest

presentation and similarity measurement. The

first machine learning issue is classification of

quests based on their QAs or CAs. A growing

Collaborative Technologies and Applications for Interactive Information Design

Search WWH ::

Custom Search

Home