Information Technology Reference
In-Depth Information
Graph Model for Pattern Recognition
in Text
Qin Wu, Eddie Fuller, and Cun-Quan Zhang
Abstract. In this paper, we propose a novel approach that uses a weighted
directed multigraph for text pattern recognition. Instead of the traditional
model which is based on the frequency of keywords for text classification, we
set up a weighted directed multigraph model using the distances between the
keywords as the weights of arcs. We then developed a keyword-frequency-
distance-based algorithm which not only utilizes the frequency information
of keywords but also their ordering information. We applied this new idea
to the detection of plagiarized papers and the detection of fraudulent emails
written by the same person. The results on these case studies show that this
new method performs much better than traditional methods.
1 Introduction
For text archives containing a large number of documents, determining the
similarity of documents is an area of research that has seen a great deal of
activity in recent years. With the advent and ubiquity of internet commu-
nication the search for related documents plays an important role in such
applications as search, detection of fraud, and the detection of conspiring
groups. Term frequency has long been used as a tool for estimating the prob-
abilistic distribution of features in a document. A number of applications have
been developed including language modeling [15], feature selection [25, 19],
and term weighting [8, 16]. Based on the term frequency information, docu-
ments can be classified by several clustering methods such as decision trees
 
Search WWH ::




Custom Search