Information Technology Reference
In-Depth Information
( ) is the probability of class c i ,
(, ) is the joint probability of class c i and the occurrence of term t,
( | ) is the probability of t given c i .
We used Boolean, TF-IDF, and lookup table convolution (LTC) to choose the
way the selected features would be weighted and represented numerically to the
CAL. Their mathematical representations are shown in Equations (5), (6), and (7),
respectively, where is the numerical weighting of selected feature j in text t :
1, ℎ ℎ
0, ℎ ℎ (5)
= ( ) log
())
(6)
(())
()
(7)
=
∑ (())
()
where:
() is a word in the text ,
is the total number of texts in the dataset,
is total number of words in the text,
() is the frequency of the word in the text and () is the number of
texts that the word occurs in.
An n × m matrix is then constructed, where n is the number of features, and m
is the number of texts in the training dataset. Each cell in this matrix is the weight
of feature j in text t . The selected features from the training dataset are then
extracted from that dataset and represented in the same way as they were in the
training dataset.
Table 2 Experimental parameters
Parameters
Description
Dataset
SPA: 6 classes, 1,526 texts
Training size
70%
Preprocessing
Removing Arabic diacritics, numbers, Latin
characters, and stop list words; normalizing
Hamza and Taa Marbutah
Testing Size
30%
FTs
Single word, 2-gram, 3-gram, 4-gram
FS methods
DF, CHI, IG, GSS
No. of Terms Selected
High-ranked terms (50, 100, 150, 200)
Threshold
Minimum DF = 10
FR schemas
Boolean, TF-IDF, LTC
CALs
NB, KNN, SVM
Number of experiments
576
Search WWH ::




Custom Search