Adaptive Information Filtering - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

8.3.1.1 Boolean models

The Boolean model is the simplest retrieval model based on Boolean algebra

and set theory. The concept is very simple and intuitive. The drawbacks

of the Boolean model are in two aspects: 1) The users may have diculty

to express their information needs using Boolean expressions; and 2) The

retrieval system can hardly rank documents since a document is predicted to

be either relevant or non-relevant without any notion of degree of relevance.

Nevertheless, the Boolean model is widely used in commercial search engines

because of its simplicity and eciency. How to use relevance feedback from

the user to refine a Boolean query is not straightforward, so the Boolean model

was extended for this purposes (34).

8.3.1.2

Vector space models

The vector model is a widely implemented IR model, most famously built

in the SMART system (52). It represents documents and user queries in a

high dimensional space indexed by “indexing terms,” and assumes that the

relevance of a document can be measured by the similarity between it and

the query in the high dimensional space (51). In the vector space framework,

relevance feedback is used to reformulate a query vector so that it is closer to

the relevant documents, or for query expansion so that additional terms from

the relevant documents are added to the original query. The most famous

algorithm is the Rocchio algorithm (50), which represents a user query using

a linear combination of the original query vector, the relevant documents

centroid, and the non-relevant documents centroid.

A major criticism for the vector space model is that its performance depends

highly on the representation, while the choice of representation is heuristic

because the vector space model itself does not provide a theoretical framework

on how to select key terms and how to set weights of terms.

8.3.1.3

Probabilistic models

Probabilistic models , such as the Binary Independence Model (BIM) ((44)),

provide direct guidance on term weighting and term selection based on proba-

bility theory. In these probabilistic models, the probability of a document d is

relevant to a user query q is modelled explicitly (43) (44) (23). Using relevance

feedback to improve parameter estimation in probabilistic models is straight-

forward according to the definition of the models, because they presuppose

relevance information.

In recent decades many researchers proposed IR models that are more gen-

eral, while also explaining already existing IR models. For example, Inference

networks have been successfully implemented in the well known INQUERY

retrieval system (57). Bayesian networks extend the view of inference net-

works. Both models represent documents and queries using acyclic graphs.

Unfortunately, both models do not provide a sound theoretical framework to

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home