Neural Networks: An Overview - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

set of relevant documents that define a topic or category. For a given topic, text

categorization therefore consists in solving a two-class discrimination problem,

which can be solved by neural networks, support vector machines (Chap. 6)

or hidden Markov models (Chap. 4).

Text categorization is a very di cult problem, which goes much beyond

text search by keywords, because a text may be relevant to a topic even though

it contains none of the keywords that define the topic, or, conversely, a text

may be irrelevant although it contains some or even all keywords.

The present application (from [Stricker 2000]) was developed by the French

bank Caisse des depots et consignations, which provides an Intranet service

for filtering press releases of Agence France Presse (AFP) in real time. The

objective of the application is twofold:

•

to develop an application that allows the user to create automatically an

information filter on any topic of interest to him, under the condition that

he provides examples of texts that are relevant to his topic of interest;

•

to develop a machine-learning based tool that monitors the obsolescence

of classical, rule-based information filters.

In the latter development, a neural-based filter is designed on the same topic

as the rule-based filter. Since the neural network does not generate a binary

response, but estimates a relevance probability, the largest discrepancies be-

tween the two filters can be analyzed and possibly be traced to vocabulary

obsolescence: documents that are rated as relevant by the rule-based method,

but whose relevance probability, estimated by the neural network, is very low,

and documents that are rated as irrelevant by the rule-based filter and having

an estimated relevance probability close to one as estimated by the neural

filter [Wolinski 2000].

The former development consists in designing and implementing an auto-

matic filter production system, whose major feature is the fact that it does not

require any assistance from an expert, as opposed to rule-based filters. There-

fore, a two-class discrimination system must be designed, from a database of

texts that are labeled as relevant or irrelevant, that requires

•

finding a representation of texts by real numbers, which should be as

compact as possible,

•

designing and implementing a classifier that uses that representation.

Thus, the problem of text representation, hence of input selection, is crucial

for that application.

1.4.5.1 Input Selection

The most popular approach to text representation is the bag-of-words repre-

sentation, whereby a text is represented by a vector, each component of which

is a number that is a function of the presence or absence of the word in the

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home