Automatically Evaluating the Quality of Contents Created by Open Collaborative Knowledge Building - Collaborative Technologies and Applications for Interactive Information Design

Information Technology Reference

In-Depth Information

Figure 3. ROC curves of the training data set (the upper curve) and the testing data set (the lower

curve)

method is to use a machine learning approach, that

is, to see if the predictive variables extracted and

their parameters estimated from one set of data

and then apply the result to another set of data.

We randomly divided the wiki pages into two

data sets, each set consists of half of the high

quality pages and half of the ordinary wiki pages,

then used one set (we call it the training data set)

to estimate the parameters of the logistic regres-

sion equation, and applied the result to predict the

quality of the wiki pages in the second set (we call

it the testing data set). In the following section,

we report the result of this application.

ing ordinary pages as high quality pages as “false

alarm”, we can plot a ROC (Receiver Operating

Characteristic, see Egan 1975 and Swets 1996)

curve by sorting the odd of being a high quality

page for each and every wiki page in the testing

data set. The odd was calculated using the stepwise

logistic regression equation constructed from the

testing data set.

In Figure 3, the upper curve is from the training

data set and the lower curve is from the testing

data set. Every point along the curves represents a

possible cutoff point to discriminate between high

quality pages and ordinary pages. The associated

detection rate of a point is the ratio of the number

of high quality pages that would be correctly

classified using that point as a cutoff to the total

number of high quality pages in the testing data

set. The associated false alarm rate of a point is the

ratio of the number of ordinary pages that would

be incorrectly classified as high quality pages

using that point as a cutoff to the total number of

ordinary pages in the testing data set.

As we can see from the graph, the predictive

power of the ROC curves are pretty good. When

detection rate of the training data set is as high as

receiver operating

Characteristic (roC) Curve

ROC (receiver operating characteristic) curve is

a tool from signal detection theory (Egan 1975,

Swets 1996). It is a graphical plot for a binary

classifier system as its discrimination threshold

is varied. It shows the chance of correct detection

as a function of the number of false alarms.

If we consider classifying high quality pages

correctly as “detection” and incorrectly classify-

Collaborative Technologies and Applications for Interactive Information Design

Search WWH ::

Custom Search

Home