Using Word N-Grams as Features in Arabic Text Classification - Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

Information Technology Reference

In-Depth Information

Using Word N-Grams as Features

in Arabic Text Classification *

Abdulmohsen Al-Thubaity, Muneera Alhoshan, and Itisam Hazzaa

Abstract. The feature type (FT) chosen for extraction from the text and presented

to the classification algorithm (CAL) is one of the factors affecting text

classification (TC) accuracy. Character N-grams, word roots, word stems, and

single words have been used as features for Arabic TC (ATC). A survey of current

literature shows that no prior studies have been conducted on the effect of using

word N-grams (N consecutive words) on ATC accuracy. Consequently, we have

conducted 576 experiments using four FTs (single words, 2-grams, 3-grams, and

4-grams), four feature selection methods (document frequency (DF), chi-squared,

information gain, and Galavotti, Sebastiani, Simi) with four thresholds for

numbers of features (50, 100, 150, and 200), three data representation schemas

(Boolean, term frequency-inversed document frequency, and lookup table

convolution), and three CALs (naive Bayes (NB), k-nearest neighbor (KNN), and

support vector machine (SVM)). Our results show that the use of single words as a

feature provides greater classification accuracy (CA) for ATC compared to N-

grams. Moreover, CA decreases by 17% on average when the number of N-grams

increases. The data also show that the SVM CAL provides greater CA than NB

and KNN; however, the best CA for 2-grams, 3-grams, and 4-grams is achieved

when the NB CAL is used with Boolean representation and the number of features

is 200.

Keywords: Arabic text classification, feature extraction, classification algorithms,

classification accuracy.

Search WWH ::

Custom Search

Home