Graph Model for Pattern Recognition in Text - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

Graph Model for Pattern Recognition

in Text

Qin Wu, Eddie Fuller, and Cun-Quan Zhang

Abstract. In this paper, we propose a novel approach that uses a weighted

directed multigraph for text pattern recognition. Instead of the traditional

model which is based on the frequency of keywords for text classification, we

set up a weighted directed multigraph model using the distances between the

keywords as the weights of arcs. We then developed a keyword-frequency-

distance-based algorithm which not only utilizes the frequency information

of keywords but also their ordering information. We applied this new idea

to the detection of plagiarized papers and the detection of fraudulent emails

written by the same person. The results on these case studies show that this

new method performs much better than traditional methods.

1 Introduction

For text archives containing a large number of documents, determining the

similarity of documents is an area of research that has seen a great deal of

activity in recent years. With the advent and ubiquity of internet commu-

nication the search for related documents plays an important role in such

applications as search, detection of fraud, and the detection of conspiring

groups. Term frequency has long been used as a tool for estimating the prob-

abilistic distribution of features in a document. A number of applications have

been developed including language modeling [15], feature selection [25, 19],

and term weighting [8, 16]. Based on the term frequency information, docu-

ments can be classified by several clustering methods such as decision trees

Search WWH ::

Custom Search

Home