Information Technology Reference
In-Depth Information
4.2
Dimensionality Reduction
When we are dealing with textual information, the feature space tends to be
large, usually on the order of several thousands of attributes (words). Hence, a
method to reduce this number of attributes is required. According to [20] the
attributes that appear in most of the files are not relevant in order to separate
these documents because all the classes have instances that contain those at-
tributes. In addition, as we are working with only two different classes (spam
and legitimate), words that appear rarely in the files have a low weight in the
identification of the class. So, the attributes that appear less than 5% and more
than 95% in all documents of the corpus were removed. At the final, the di-
mension of the feature vectors is 751. The benefit of dimension reduction also
includes, in some cases, an improvement in prediction accuracy [21].
5
Performance Measures
Once generated a classifier, it is necessary to obtain some indexes that can
measure its performance and facilitate the comparison with other classifiers. In
pattern recognition and information retrieval, when there are multiple categories,
performance measures such as recall and precision are used. Although spam
detection is a binary classification task, these measures will be used here to
estimate the accuracy of the methods.
We will adopt the same notation used in [4,22], using L and S to represent
legitimate and spam message respectively; and n L S (legitimate to spam or
false positive) and n S L (spam to legitimate or false negative) to denote the
two error types, respectively. Then, the spam recall and the spam precision are
defined here as follows in equations 4 and 5.
n S S
n S S + n S L
SR =
(4)
n S S
n S S + n L S
SP =
(5)
In anti-spam filtering, misclassifying a legitimate mail as spam is worse than
letting a spam message pass the filter. If a spam goes through the filter, the only
inconvenience that it may cause is the time wasted to remove that message from
the inbox. However, if an important legitimate mail message was misclassified, a
real disaster can happen. When the error types (false positive and false negative)
have distinct relevance the usual precision and recall measures can not express
well the performance and it is necessary to adopt some cost sensitive evaluation
measures.
Androutsoupoulos et al. [4] introduced a weighted accuracy measure (WAcc)
that assign to false positive a higher cost than false negative and has been used
in some spam filtering benchmarks [4,8,22]. WAcc is defined as:
λ.n L L + n S S
λ.N L + N S
WE r = λ.n L S + n S L
λ.N L + N S
WAcc =
,
(6)
Search WWH ::




Custom Search