Using Word N-Grams as Features in Arabic Text Classification - Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

Information Technology Reference

In-Depth Information

To train the CAL using the selected features from the training data, we used

three of the most commonly used CALs, KNN, naive Bayes (NB), and support

vector machine (SVM). For more information regarding FS methods, feature RSs,

and CALs, see [19]. We used the RapidMiner 4.0 [20] implementations of these

CALs to train and test the classification model. The experimental parameters are

summarized in Table 2.

3

Results and Discussion

Conducting experiments with the parameters in Table 2, we obtained the data in

Table 3.

Table 3 Experimental results

Classifier

Grams Average

Minimum

Maximum

1

60.35

39.25

IG, LTC, 150

75

CHI, TFiDF, 200

2

50.38

25.88

CHI, LTC, 200

66.45

IG, Boolean, 200

NB

3

38.76

21.05

CHI, LTC, 200

51.32

DF, Boolean, 200

4

32.77

18.86

CHI, LTC, 200

42.98

GSS, Boolean, 200

1

49.14

40.13

DF, TFiDF, 100

58.77

CHI, LTC, 50

2

41.35

28.29

IG, Boolean, 50

51.54

CHI, LTC, 50

KNN

3

37.63

33.33

DF, TFiDF, 50

41.89

IG, LTC, 150

4

33.54

30.48

IG, Boolean, 50

36.4

GSS, LTC, 150

1

72.35

67.54

DF, TFiDF, 50

75.44

IG, LTC, 200

2

58.73

49.78

DF, Boolean, 50

65.13

IG, LTC, 200

SVM

CHI, Boolean, 50

3

41.81

35.53

47.37

IG, TFiDF, 200

4

35.07

31.58

CHI,Boolean,50

38.6

GSS, LTC, 100

The table lists the average CA for each gram number of each classifier, the

minimum and maximum CA for that classifier, and the combination of FS method,

RS, and number of terms that has produced the respective value. The data suggest

that, on average, for all gram numbers, the CA of SVM is greater than that of NB,

followed by that of KNN. In addition, the data suggest that greater CA is achieved

when single words are used as a feature, and that CA declines by 17%, on average,

when the gram number increases.

Notably, while SVM achieved greater CA, on average, and the best CA using

single words, NB exhibited greater CA for 2-grams, 3-grams, and 4-grams. The

data show that the best CA was achieved when the number of terms was 200, the

maximum number of terms used in our experiments. According to the data, NB

worked well with Boolean representation (three of the best results were achieved

Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

Search WWH ::

Custom Search

Home