Graph Model for Pattern Recognition in Text - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

[1], neural networks [18, 13], Bayesian methods [12, 24], or support vector

machines [21, 7, 22].

The term frequency method is an effective approach if a rough classifica-

tion of documents based on their subjects or themes. However, if one would

like to further determine the similarity of writing patterns or determine the

authorship of documents, the traditional term frequency method will pro-

vide only very rough estimates with little accuracy or reliability. The main

drawback of the term frequency method is the fact that it relies on a bag-of-

words [6, 10, 20] approach. It implies feature independence, and disregards

any dependencies that may exist between words in the text. The bag-of-words

model may not be the best technique to capture keyword importance. If the

text structure information could be preserved properly at the same time, it

would lead to a better keyword weighting scheme [5].

In this paper, we introduce a new approach that exploits not only the

keyword frequency but also their location and ordering. We represent a doc-

ument as a weighted directed multigraph by taking keywords as the vertices

and constructing arcs whose weighting contains the relation information of

akeywordtootherkeywords.Theadjacency matrix of the graph induces a

signature vector for the document. A clustering method is then applied to the

set of signature vectors for grouping similar documents into clusters. With

this new approach, we are able to evaluate the similarity between any two

documents from a set of text documents within the SAME category.

A set of detailed algorithms for the estimation of signature vectors and

clustering are presented in this paper. This algorithm has been applied to

two sets of sample documents.

1. Nigerian Fraud Emails, each of which has the same topic: to transfer money

into some bank accounts in order to receive lager sum of payback.

2. Papers in academic journals in graph theory, some of which are known to

be plagiarized.

Each group is in one category, and therefore, keywords may appear with

similar frequencies. The traditional method of sorting documents by keyword-

frequency is able to filter this group out off a lager subset of documents with

many different subjects. However, by considering the ordering and location

of keywords, we are able to further evaluate their similarity within their own

group, i.e. to classify fraudulent emails authored by the same person or copy-

pasted types with slight modification, or to identify the plagiarized papers.

In next section, we describe the schema for representing a document as

a weighted directed multigraph. Section 3 discusses the computation com-

plexity. In section 4, we present some application examples of our algorithm.

Finally, in section 5, the conclusion is presented and future research problems

are outlined.

Search WWH ::

Custom Search

Home