Using Word N-Grams as Features in Arabic Text Classification - Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

Information Technology Reference

In-Depth Information

One of the reasons for the poor CA results, when N-grams were used as

features, was that the data were sparse, as a result the facts that the same Arabic

word can occur in different forms and that using word N-grams increases the

number of features, decreasing the frequency of each feature.

4

Conclusion

In this study, we conducted experiments to investigate the effect of word N-grams

as features for ATC. We used the SPA dataset, four FS methods, different

numbers of top-ranked features, three feature RSs, and 3 CALs, as summarized in

Table 2. The data, summarized in Table 3, show that single words as a feature for

ATC produced better results than other word N-grams. The best average CA was

achieved using the SVM CAL; however, the NB CAL provided better CA results

for 2-grams, 3-grams, and 4-grams compared to SVM and KNN, using the same

word N-grams.

The main conclusion of our study is that the use of single words for ATC is

more effective than using N-grams. Further investigation is required in our future

work to validate our results using larger datasets from different domains and

genres, such as newspapers and scientific texts, to reduce data sparseness.

References

1. Alarifi, A., Alghamdi, M., Zarour, M., Aloqail, B., Lraqibah, H., Alsadhan, K.,

Alkwai, L.: Estimating the Size of Arabic Indexed Web Content. Scientific Research

and Essays 7(28), 2472-2483 (2012)

2. Mesleh, A.M.: Feature sub-set selection metrics for Arabic text classification. Pattern

Recognition Letters 32(14), 1922-1929 (2011)

3. Althubaity, A., Almuhareb, A., Alharbi, S., Al-Rajeh, A., Khorsheed, M.: KACST

Arabic Text Classification Project: Overview and Preliminary Results. In: 9th IBMIA

Conference on Information Management in Modern Organizations (2008)

4. Alwedyan, J., Hadi, W.M., Salam, M., Mansour, H.Y.: Categorize Arabic data sets

using multi-class classification based on association rule approach. In: Proceedings of

the 2011 International Conference on Intelligent Semantic Web-Services and

Applications, vol. 18 (2011)

5. Khorsheed, M.S., Al-Thubaity, A.O.: Comparative evaluation of text classification

techniques using a large diverse Arabic dataset. Language Resources and

Evaluation 47(2), 513-538 (2013)

6. Duwairi, R., Al-Refai, M.N., Khasawneh, N.: Feature reduction techniques for Arabic

text categorization. Journal of the American Society for Information Science and

Technology 60(11), 2347-2352 (2009)

7. Noaman, H.M., Elmougy, S., Ghoneim, A., Hamza, T.: Naive Bayes classifier based

Arabic document categorization. In: 7th International Conference on Informatics and

Systems (INFOS 2010), pp. 1-5 (2010)

8. Harrag, F., El-Qawasmah, E., Al-Salman, A.M.S.: Comparing dimension reduction

techniques for Arabic text classification using BPNN algorithm. In: First International

Conference on Integrated Intelligent Computing (ICIIC 2010), pp. 6-11 (2010)

Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

Search WWH ::

Custom Search

Home