Information Technology Reference
In-Depth Information
+ word “two topics” ( نﺎﺑﺎﺘآ
ﻦﻴﺑﺎﺘآ ), or a complete sentence “I will write it” (ﻪﺒﺘآﺄﺳ).
Different FS methods can be used to reduce the feature space.
The second type is a stem or root. In this approach, each word in the dataset is
analyzed morphologically to remove affixes, extracting the stem of the word
orthography, and this stem is then analyzed further to extract its root. This
approach is useful for reducing the number of features and the sparseness of data.
Usually no FS method is used with this approach. The results show that using
word orthography is more accurate for ATC [6][7][8][9]. The reason for the poor
results of using stem or root as features for ATC is the low accuracy of the
morphology analyzers used [10].
The third type is the character N-gram. In this approach, any consecutive N
characters can be considered a feature. This model involves trying to remove
affixes without any morphological analysis to get the root/stem, which is three
letters for most Arabic words [11][12]. The drawback of this approach is that it
produces a very large number of features, possibly affecting classification
accuracy (CA). To the best of our knowledge, no comparative analysis has been
conducted with the same dataset and experimental environments to assess the
performance of using the three feature types.
The combined use of unigram word orthography and bigram word orthography
features for ATC was examined in [13]. The authors compared the use of word
orthography unigrams and bigrams to the use of word orthography unigrams alone
in CA of the k-nearest neighbor (KNN) CAL. They used document frequency
(DF) for FS, with a threshold of three, and term frequency-inversed document
frequency (TF-IDF) as the RS. They argued that the combined use of word
orthography unigrams and bigrams provides greater accuracy than using only
single words. We cannot trust this argument fully, because the authors used a
subset (four classes) of a dataset of 1,445 texts distributed over nine classes but
provided no justification for selecting those four classes, rather than the entire
dataset.
Studying the effect of using word-level N-grams on TC for other languages has
shown contradictory results. Although use of the single word provides greater
accuracy for Turkish TC [14], the data show that using N-grams produces better
results than single terms for Farsi TC [15]. To the best of our knowledge, there has
been no study that compares the accuracy of ATC using only word-level N-grams
as features with that using word orthography. That is what we do in this study.
/
2
Materials and Methods
2.1 Dataset
We used the Saudi Press Agency (SPA) dataset, a part of the King Abdulaziz City
for Science and Technology (KACST) ATC dataset that has been utilized in
several ATC studies [9][16] [3][17][18][5]. This dataset consists of 1,526 texts
evenly divided into six news classes, cultural, sports, social, economic, political,
and general. The basic SPA statistics are illustrated in Table 1.
Search WWH ::




Custom Search