Information Technology Reference
In-Depth Information
Using Word N-Grams as Features
in Arabic Text Classification *
Abdulmohsen Al-Thubaity, Muneera Alhoshan, and Itisam Hazzaa
Abstract. The feature type (FT) chosen for extraction from the text and presented
to the classification algorithm (CAL) is one of the factors affecting text
classification (TC) accuracy. Character N-grams, word roots, word stems, and
single words have been used as features for Arabic TC (ATC). A survey of current
literature shows that no prior studies have been conducted on the effect of using
word N-grams (N consecutive words) on ATC accuracy. Consequently, we have
conducted 576 experiments using four FTs (single words, 2-grams, 3-grams, and
4-grams), four feature selection methods (document frequency (DF), chi-squared,
information gain, and Galavotti, Sebastiani, Simi) with four thresholds for
numbers of features (50, 100, 150, and 200), three data representation schemas
(Boolean, term frequency-inversed document frequency, and lookup table
convolution), and three CALs (naive Bayes (NB), k-nearest neighbor (KNN), and
support vector machine (SVM)). Our results show that the use of single words as a
feature provides greater classification accuracy (CA) for ATC compared to N-
grams. Moreover, CA decreases by 17% on average when the number of N-grams
increases. The data also show that the SVM CAL provides greater CA than NB
and KNN; however, the best CA for 2-grams, 3-grams, and 4-grams is achieved
when the NB CAL is used with Boolean representation and the number of features
is 200.
Keywords: Arabic text classification, feature extraction, classification algorithms,
classification accuracy.
Search WWH ::




Custom Search