Information Technology Reference
In-Depth Information
Table 5. Confusion matrix of the exploratory study using linear discriminant analysis
Actual Group
No. of Cases
Predicted Group Membership
High Quality Pages
Ordinary Pages
High Quality Pages
58
54
4
93.1%
6.9%
Ordinary Pages
85
13
72
15.3%
84.7%
Percent of “grouped” cases correctly classified: 88.1%
4% decrease in the correct classification rate. The
remaining variables were:
sumes a multivariate Gaussian distribution of the
predictive variables, and that may not be valid for
the array of variables based on textual character-
istics and collaborative frequencies. Therefore we
supplemented discriminant analysis with logistic
regression.
We constructed the following equation:
P
Sum of request: The sum of all kinds of re-
quests (e.g., “peer review request”, “clean
up request”, “page needing attention” re-
quest, etc.) placed by registered users ei-
ther to the author or to the others.
e y
= where p is the probability for a wiki
Interaction frequency: The number of
time a page received revision and modi-
fication by a contributor who was not the
same contributor of the immediate previ-
ous edition of the same page.
1-
P
page to have high quality, and y is a linear com-
bination of the predictor variables: y = α 0 + Sα i v i.
The ratio between p and 1 - p should be greater
than one for high quality pages, and less than
one for ordinary wiki pages. Alpha was chosen
to maximize the correct prediction rate (Menard,
1995).
Applying the same step-wise selection algo-
rithm mentioned in the previous section to the
logistic regression, we retrieved a slightly differ-
ent set of good predictive variables: the Average
length of paragraph (in number of words) was
gone, instead, the logistic regression picked up
Length of leading paragraph (in number of
words). The correct classification rate is slightly
better than the discrimination analysis: 90.1% of
the wiki pages were correctly classified.
We continued our investigation using some
other non-parametric approaches, including deci-
sion tree method, local weighted regression, and
support vector machine method. They all had
similar performance, and no one bit the perfor-
mance of logistic regression.
The best way to understand the power of the
Active-member involved: The number of
active members contributed to a page (an
active contributor is defined as a registered
member who contributes more than 4 times
to the wiki, where 4 is median of the num-
ber of contributions per contributor)
Average length of paragraph: Number of
words divided by number of paragraph
Use frequency of active talk pages: The
number of “article talk page” used by
contributors
Personal names and Organizational
name: The frequencies of personal names
and organizational names appear in a
page.
logistic regression
There are some inherent limitations associated
with discriminant analysis. For example, it as-
Search WWH ::




Custom Search