Information Technology Reference
In-Depth Information
conclude that document D is belongs to r classes { l h+1 , l h+2 ,…, l h+r } where { l h+1 ,
l h+2 ,…, l h+r }
{ l 1 , l 2 ,…, l k }.
Table 2 Classifying document D
Classification Expression
Value
1
W 1 *T
D + b 1 *
: D
l 1
1
: D
l 1
1
W 2 *T
D + b 2 *
: D
l 2
1
: D
l 2
W k *T
1
: D
l k
D + b k *
1
: D
l k
3 Document Classification Based on Decision Tree
Given a set of classes C = { computer science, math }, a set of terms T =
{ computer, programming language, algorithm, derivative } and the corpus D =
{ doc1.txt, doc2.txt, doc3.txt, doc4.txt, doc5.txt }. The training data is showed in
following table in which cell ( i, j ) indicates the number of times that term j
(column j ) occurs in document i (row i ).
Table 3 Term frequencies of documents
programming
language
computer
algorithm derivative
class
doc1.txt
5
3
1
1
computer
doc2.txt
5
5
40
5
math
doc3.txt
20
5
20
55
math
doc4.txt
20
55
5
20
computer
doc5.txt
15
15
4
0.3
math
doc6.txt
35
10
45
10
computer
Table 4 Normalized term frequencies
computer programming
language
algorithm derivative
class
doc1.txt
0.5
0.3
0.1
0.1
computer
doc2.txt
0.05
0.05
0.4
0.5
math
doc3.txt
0.2
0.05
0.2
0.55
math
doc4.txt
0.2
0.55
0.05
0.2
computer
doc5.txt
0.15
0.15
0.4
0.3
math
doc6.txt
0.35
0.1
0.45
0.1
computer
Search WWH ::




Custom Search