Graph Model for Pattern Recognition in Text - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

of them are far less than 0.2. This strongly indicates that the KFP method

works perfectly for the detection of a plagiarism paper.

However, if we use KF method, the similarity between the plagiarism

paper and the original paper is 0 . 97074(see Table 5 ). And we also find other

6 pairs of papers have similarities greater than the similarity between the

plagiarism paper and the original paper. For example, the similarity between

Paper-25 and Paper-34 is above 0 . 99, (note that the similarity between these

two papers by KFP method is 0 . 35). From Table 5, we can see that KFP

method performs better than KF method.

5 Conclusions and Future Work

In this paper, we introduced a weighted directed multigraph to model a text

document. This method considers not only the keyword frequency informa-

tion, but also the structure information in the form of the relations between

keywords in documents. Through experiments performed on a set of emails

and a set of research papers on graph theory, it is evident that the weighted

directed multigraph model achieves significantly better than the commonly

used frequency only model.

We performed experiments on two sets of documents. For the set of graph

theory publications, publicly accessible knowledge about identified plagia-

rised papers provides us a meaningful “yardstick” for the measurement of

the accuracy and effectiveness of our novel method. We may summarize our

result with the following conclusion: the KFP method is able to single out

the plagiarised pair with the highest similarity which is much larger than any

other pair of papers, while the KF method produces may results without any

meaningful gap of similarity to distinguish positive and negative results.

We also tried a weighted undirected multigraph model (i.e, neglect the

direction from one keyword to the other keyword in the graph). Although it

will lose some structure information of the document, the result is also very

similar to what we described above. The advantage of undirected version is

the significant reduction of the usage of memory space comparing with the

weighted directed multigraph model.

These initial results indicated that the algorithm is much more effective

at discriminating and clustering text documents and further improvement

of accuracy and performance is expected. Specificially, it is anticipated that

one can construct an ontological representation of the semantic information

[9, 17, 2] to further enhance the KFP measure and that this information can

then be used to set up the directed weighted multigraph. This will in turn

allow us to use QCM method to classify all documents with even better

precision.

Representing a document as a weighted directed multigraph model is the

novel idea introduced in this paper. This approach enables us to further dis-

tinguish documents from the SAME category into smaller groups base on

Search WWH ::

Custom Search

Home