Information Technology Reference
In-Depth Information
[1], neural networks [18, 13], Bayesian methods [12, 24], or support vector
machines [21, 7, 22].
The term frequency method is an effective approach if a rough classifica-
tion of documents based on their subjects or themes. However, if one would
like to further determine the similarity of writing patterns or determine the
authorship of documents, the traditional term frequency method will pro-
vide only very rough estimates with little accuracy or reliability. The main
drawback of the term frequency method is the fact that it relies on a bag-of-
words [6, 10, 20] approach. It implies feature independence, and disregards
any dependencies that may exist between words in the text. The bag-of-words
model may not be the best technique to capture keyword importance. If the
text structure information could be preserved properly at the same time, it
would lead to a better keyword weighting scheme [5].
In this paper, we introduce a new approach that exploits not only the
keyword frequency but also their location and ordering. We represent a doc-
ument as a weighted directed multigraph by taking keywords as the vertices
and constructing arcs whose weighting contains the relation information of
akeywordtootherkeywords.Theadjacency matrix of the graph induces a
signature vector for the document. A clustering method is then applied to the
set of signature vectors for grouping similar documents into clusters. With
this new approach, we are able to evaluate the similarity between any two
documents from a set of text documents within the SAME category.
A set of detailed algorithms for the estimation of signature vectors and
clustering are presented in this paper. This algorithm has been applied to
two sets of sample documents.
1. Nigerian Fraud Emails, each of which has the same topic: to transfer money
into some bank accounts in order to receive lager sum of payback.
2. Papers in academic journals in graph theory, some of which are known to
be plagiarized.
Each group is in one category, and therefore, keywords may appear with
similar frequencies. The traditional method of sorting documents by keyword-
frequency is able to filter this group out off a lager subset of documents with
many different subjects. However, by considering the ordering and location
of keywords, we are able to further evaluate their similarity within their own
group, i.e. to classify fraudulent emails authored by the same person or copy-
pasted types with slight modification, or to identify the plagiarized papers.
In next section, we describe the schema for representing a document as
a weighted directed multigraph. Section 3 discusses the computation com-
plexity. In section 4, we present some application examples of our algorithm.
Finally, in section 5, the conclusion is presented and future research problems
are outlined.
Search WWH ::




Custom Search