Information Technology Reference
In-Depth Information
The hidden layer is constituted of two hidden units:
“computer science”,
“math”
. These units (variables) are also binary or Boolean. The output layer has
only one unit named
“document
class”
which is binary or Boolean (0 -
documents belong to
computer science
class and 1 - documents belong to
math
class). The evaluation function used in network is sigmoid function. Our topology
is feed-forward neural network (showed in figure 4) in which the weights can be
initialized arbitrarily.
Note that we denote Boolean value as
0
and
1
(instead of
true
and
false
) for
convenience when representing neural network which only accepts numeric value
for units.
C
0.4
0.6
0.6
P
S
0.4
0.5
L
0.5
0.5
A
M
0.5
0.3 0.7
D
Input layer
Hidden layer
Output layer
Fig. 9
The neural network for document classification
Note that
C, P, A
and
D
denote “
computer”, “programming language”,
“algorithm”
and
“derivative”
respectively.
S
and
M
denote “
computer science
”
and “
math
” respectively.
L
denotes
“doc
class”
.
Given corpus
D
= {
doc1.txt, doc2.txt, doc3.txt, doc4.txt, doc5.txt
}. The training
data is showed in following table in which cell (
i, j
) indicates the number of times
that term
j
(column
j
) occurs in document
i
(row
i
).
Table 7
Term frequencies of documents
programming
language
computer
algorithm derivative
class
doc1.txt
5
3
1
1
computer
doc2.txt
5
5
40
5
math
doc3.txt
20
5
20
55
math
doc4.txt
20
55
5
20
computer
doc5.txt
15
15
4
0.3
math
doc6.txt
35
10
45
10
computer