Information Technology Reference
In-Depth Information
of them are far less than 0.2. This strongly indicates that the KFP method
works perfectly for the detection of a plagiarism paper.
However, if we use KF method, the similarity between the plagiarism
paper and the original paper is 0 . 97074(see Table 5 ). And we also find other
6 pairs of papers have similarities greater than the similarity between the
plagiarism paper and the original paper. For example, the similarity between
Paper-25 and Paper-34 is above 0 . 99, (note that the similarity between these
two papers by KFP method is 0 . 35). From Table 5, we can see that KFP
method performs better than KF method.
5 Conclusions and Future Work
In this paper, we introduced a weighted directed multigraph to model a text
document. This method considers not only the keyword frequency informa-
tion, but also the structure information in the form of the relations between
keywords in documents. Through experiments performed on a set of emails
and a set of research papers on graph theory, it is evident that the weighted
directed multigraph model achieves significantly better than the commonly
used frequency only model.
We performed experiments on two sets of documents. For the set of graph
theory publications, publicly accessible knowledge about identified plagia-
rised papers provides us a meaningful “yardstick” for the measurement of
the accuracy and effectiveness of our novel method. We may summarize our
result with the following conclusion: the KFP method is able to single out
the plagiarised pair with the highest similarity which is much larger than any
other pair of papers, while the KF method produces may results without any
meaningful gap of similarity to distinguish positive and negative results.
We also tried a weighted undirected multigraph model (i.e, neglect the
direction from one keyword to the other keyword in the graph). Although it
will lose some structure information of the document, the result is also very
similar to what we described above. The advantage of undirected version is
the significant reduction of the usage of memory space comparing with the
weighted directed multigraph model.
These initial results indicated that the algorithm is much more effective
at discriminating and clustering text documents and further improvement
of accuracy and performance is expected. Specificially, it is anticipated that
one can construct an ontological representation of the semantic information
[9, 17, 2] to further enhance the KFP measure and that this information can
then be used to set up the directed weighted multigraph. This will in turn
allow us to use QCM method to classify all documents with even better
precision.
Representing a document as a weighted directed multigraph model is the
novel idea introduced in this paper. This approach enables us to further dis-
tinguish documents from the SAME category into smaller groups base on
Search WWH ::




Custom Search