Information Technology Reference
In-Depth Information
set of relevant documents that define a topic or category. For a given topic, text
categorization therefore consists in solving a two-class discrimination problem,
which can be solved by neural networks, support vector machines (Chap. 6)
or hidden Markov models (Chap. 4).
Text categorization is a very di cult problem, which goes much beyond
text search by keywords, because a text may be relevant to a topic even though
it contains none of the keywords that define the topic, or, conversely, a text
may be irrelevant although it contains some or even all keywords.
The present application (from [Stricker 2000]) was developed by the French
bank Caisse des depots et consignations, which provides an Intranet service
for filtering press releases of Agence France Presse (AFP) in real time. The
objective of the application is twofold:
to develop an application that allows the user to create automatically an
information filter on any topic of interest to him, under the condition that
he provides examples of texts that are relevant to his topic of interest;
to develop a machine-learning based tool that monitors the obsolescence
of classical, rule-based information filters.
In the latter development, a neural-based filter is designed on the same topic
as the rule-based filter. Since the neural network does not generate a binary
response, but estimates a relevance probability, the largest discrepancies be-
tween the two filters can be analyzed and possibly be traced to vocabulary
obsolescence: documents that are rated as relevant by the rule-based method,
but whose relevance probability, estimated by the neural network, is very low,
and documents that are rated as irrelevant by the rule-based filter and having
an estimated relevance probability close to one as estimated by the neural
filter [Wolinski 2000].
The former development consists in designing and implementing an auto-
matic filter production system, whose major feature is the fact that it does not
require any assistance from an expert, as opposed to rule-based filters. There-
fore, a two-class discrimination system must be designed, from a database of
texts that are labeled as relevant or irrelevant, that requires
finding a representation of texts by real numbers, which should be as
compact as possible,
designing and implementing a classifier that uses that representation.
Thus, the problem of text representation, hence of input selection, is crucial
for that application.
1.4.5.1 Input Selection
The most popular approach to text representation is the bag-of-words repre-
sentation, whereby a text is represented by a vector, each component of which
is a number that is a function of the presence or absence of the word in the
Search WWH ::




Custom Search