Information Technology Reference
In-Depth Information
each e-mail has to be classified as spam or legitimate. Finally, machine-learning
models are generated based upon the labelled data.
This task is usually performed for text categorisation. Since text classifica-
tion mostly uses the content of the documents and external sources to build
accurate document classifiers, there is a great effort in the scientific community
[9,10,11] directed towards the link structure among documents, to improve the
performance of document classification.
The connections that can be found within documents vary from the most
common citation graph, such as papers citing other papers or websites linking
other websites, to links constructed from relationships including: co-author, co-
citation, appearance at a conference venue, and others. The combination of these
connections leads to the creation of an interlinked collection of text documents.
In some cases, it is interesting to determine the topic of not just a single
document, but to infer it for a collection of unlabelled documents. Collective
classification tries to collectively optimise the problem taking into account the
connections present among the documents. This is a semi-supervised technique,
i.e., uses both labelled and unlabelled data - typically a small amount of labelled
data and a large amount of unlabelled data - that reduces the labelling work.
Given this background, we propose the first spam filtering system that uses
collective classification to optimise the classification performance. Through this
approach, we minimise the necessity of labelled e-mails without a significant
penalisation of the accuracy of detection.
Summarising, our main findings are the following: (i) we describe how to
adopt collective classification for spam filtering, (ii) we try to determine which
is the optimal size of the labelled dataset for collective-classification-based spam
filtering, and (iii) we show that this approach can reduce the efforts of labelling
e-mails while maintaining a high accuracy rate.
The reminder of this paper is organised as follows. Section 2 describes the
process of using collective classification applied to the spam filtering problem.
Section 3 details the experiments performed and presents the results. Finally,
Section 4 concludes and outlines avenues for future work.
2 Collective Classification for Spam Filtering
Collective classification is a combinatorial optimization problem, in which we
are given a set of documents, or nodes,
D
=
{d 1 , ..., d n }
and a neighbourhood
function N ,where N i ⊆D\{D i }
, which describes the underlying network struc-
ture [12]. Being
D
a random collection of documents, it is divided into two sets
X
and
Y
where
X
corresponds to the documents for which we know the correct
values and
are the documents whose values need to be determined. Therefore,
the task is to label the nodes
Y
Y i ∈Y
with one of a small number of labels,
L
.
Since the spam problem can be tackled as a text classification problem, we
use the Waikato Environment for Knowledge Analysis (WEKA) [13] and its
=
{l 1 , ..., l q }
 
Search WWH ::




Custom Search