Using Word N-Grams as Features in Arabic Text Classification - Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing - page 37

Information Technology Reference

In-Depth Information

( ) is the probability of class c i ,

(, ) is the joint probability of class c i and the occurrence of term t,

( | ) is the probability of t given c i .

We used Boolean, TF-IDF, and lookup table convolution (LTC) to choose the

way the selected features would be weighted and represented numerically to the

CAL. Their mathematical representations are shown in Equations (5), (6), and (7),

respectively, where is the numerical weighting of selected feature j in text t :

1, ℎ ℎ

0, ℎ ℎ (5)

= ( ) log

())

(6)

(())

()

(7)

=

∑ (())

()

where:

() is a word in the text ,

is the total number of texts in the dataset,

is total number of words in the text,

() is the frequency of the word in the text and () is the number of

texts that the word occurs in.

An n × m matrix is then constructed, where n is the number of features, and m

is the number of texts in the training dataset. Each cell in this matrix is the weight

of feature j in text t . The selected features from the training dataset are then

extracted from that dataset and represented in the same way as they were in the

training dataset.

Table 2 Experimental parameters

Parameters

Description

Dataset

SPA: 6 classes, 1,526 texts

Training size

70%

Preprocessing

Removing Arabic diacritics, numbers, Latin

characters, and stop list words; normalizing

Hamza and Taa Marbutah

Testing Size

30%

FTs

Single word, 2-gram, 3-gram, 4-gram

FS methods

DF, CHI, IG, GSS

No. of Terms Selected

High-ranked terms (50, 100, 150, 200)

Threshold

Minimum DF = 10

FR schemas

Boolean, TF-IDF, LTC

CALs

NB, KNN, SVM

Number of experiments

576

Next Page

Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

Search WWH ::

Custom Search

Home