Information Technology Reference
In-Depth Information
1
Introduction
1.1 Background
Several initiatives, such as the King Abdullah Initiative for Arabic Content, have
been launched in the past few years to support the growth and quality of Arabic
content on the Internet. Several studies have advocated the rapid growth of Arabic
content on the web as well, as in systems of government institutions and private
enterprises [1]. This proliferation of Arabic content requires techniques and tools
that are capable of organizing and handling that content in intelligent ways. Text
classification (TC)—the process of automatically assigning a text to one or more
predefined classes—is one of many techniques that can be used to organize and
maximize the benefits of existing Arabic content.
Researchers have focused considerable effort on English TC. Applying existing
techniques that were proven to be suitable for English TC may seem to be a
simple option for Arabic TC (ATC). However, it has been shown that what is
suitable for English TC is not necessarily suitable for Arabic [2].
In general terms, implementing a TC system requires several consecutive steps,
collecting a representative text sample, dividing the sample into training and
testing sets, extracting the features, selecting the representative features,
representing the selected feature for the classification algorithm (CAL), applying
the algorithm, producing the classification model, applying the algorithm on
testing data, and evaluating the performance of the classification model. The
techniques used in each of the above steps affect the accuracy of the TC system in
various ways. In this study, we investigate an effect that has not been studied
previously in ATC. Specifically, we study the effect of feature type (FT) in ATC
accuracy. Instead of single words, word N-grams (N consecutive words) are
employed as features using four feature-selection methods, three representation
schemas (RSs), and three CALs.
In the rest of this section, we summarize the FTs previously used in ATC and
the reported results of using the word N-gram as a feature for TC in other
languages. The dataset we have used in this study is illustrated in Section 2.1, and
the feature selection (FS) methods, feature RSs, CALs, and other experimental
parameters are presented in Section 2.2. In Section 3, the results and our
interpretation are discussed. In Section 4, we outline our conclusions and future
work.
1.2 Related Work
When planning a TC implementation, the FTs required for text representation
must be considered. In ATC, three FTs are primarily used, the first and most
common being word orthography (see, for example, [3][4][5]). In this approach,
any sequence of Arabic letters bounded by two spaces is considered a feature. In
the Arabic writing system, the features produced with this method can be, for
example, a single word such as “book” (بﺎﺘآ), the + word “the topic” (بﺎﺘﻜﻟا), two
Search WWH ::




Custom Search