Self-Organizing Maps and Unsupervised Classification - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

summaries. The average length of such abstract is 132 words. To process

also number and symbols, it has been decided to cancel the words that are

too scarce (less that 50 occurrences) and also to cancel 1355 words that are

semantically poor. Eventually a set of 43,222 words has been considered for

the whole corpus.

Several versions of the system exist. The earlier was coding straight the

text histogram by a vector, the length of which was the word number of the

corpus. According to that coding, each component of the text representative

vector represents the weighted occurrence frequency of the associated word

in the text. The weights were fixed according to the influence of the word

on the document global meaning. This dimension was too large to allow fur-

ther processing. Several data compression methods were proposed to cope

with this high dimension problem: projection reduction (principal component

analysis) or random projection. Eventually a random projection method was

implemented. A 500-dimensional vector represents each text. Such a vector is

a text summary coming from a statistical analysis of the text vocabulary. The

coding complexity is O(NL)+(n), where N is the document number, L is the

average number of distinct words in a document and n is the initial histogram

dimension. To appreciate the reduction, it is interesting to point out that the

simpler projection compression method has a complexity equivalent to Nld .

Thus, the reduction is quite important and enables to extend Websom over

the whole corpus.

7.5.3.2 Specific Features of Learning Process

A visual representation of the corpus organization is possible through the two-

dimensional map. That is a great help for documentary research. At the end

of the learning phase, the allocation phase that associates a neuron to a doc-

ument enables to locate a given document with respect to the global corpus:

texts with similar meanings are supposed to be located in close zones on the

map. In Websom's last version, the corpus is divided into 21 sections (agricul-

ture, transportation, chemistry, electricity, etc.). To extract this information,

each neuron is endowed with one of the section labels and a set of key words.

These keywords are extracted from the subset of texts that are allocated to

this neuron. More precisely, the type is determined through a majority voting

over the text subset and the keywords are selected by building the intersection

of the key-words set of every text of the text subset.

When Websom is used, texts with close meaning are projected in closed

regions of the two-dimensional map. So, projecting the text onto the map

enables to locate its meaning with respect to the whole set of texts of the

learning basis, actually the whole corpus. Using the map labeling enables to

interpret a new text through an automatic process. The neighbor neurons

provide subsidiary information that allows a finer understanding.

Considering the very large number of documents that lie in the basis,

a large amount of neurons are required in order to perform a fine enough

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home