Information Technology Reference
In-Depth Information
Table 4. Categories and examples of features of wiki pages used in data analysis
Feature Categories
Examples
Punctuation
Number of periods, question marks, exclamation marks, commas, semicolons, colons, dash, ellipsis,
parentheses, brackets, quotation marks, forward slides, apostrophes, hyphens
Length
Average length of words in characters, sentence in words, paragraph in words. Length of title, subtitle,
leading paragraph, and document
Unique words
Number of unique words; number of unique words excluding stop words
Entities
Number of persons, locations, organizations, and dates
Request activities
Number of time the pages has received a request from a registered member (e.g., “peer review” request,
“clean up request”, “page needing attention” request, etc.)
View activities
Number of time the wiki page has been viewed by users; Number of time the wiki page has been viewed
by registered members
Page categories
Number of categories a page belongs to
Links
Number of outward http links; number of internal wiki links (link that points to another wiki page in the
same wiki) in a wiki page
Collaborative activities
Number of comments and exchanges in the “article talk page”; Number of modification or addition
Intensity of Collabora-
tion
Number of active registered members contributed to the page; Number of times a page received revision
and modification by a contributor who was not the same contributor of the immediate previous edition of
the same page; Time span counted from the new addition of the page to the final modification of the page
measured in number of hours) divided by the number of modification.
Age
The age of a page measured in number of hours
Part of Speech
Number of tokens, proper nouns, personal pronouns, possessive pronouns, determiners, preposition, verbs
in base form, verbs in past tense, verbs in present participle, verbs in past participle, verbs in present
tense, verbs in ing form
where High denotes high quality pages, Low
denotes ordinary pages, and N is the number of
pages.
The result of the discriminant analysis is sum-
marized in Table 5 as a confusion matrix (Kohavi
and Provost, 1998), which contains information
about actual and predicted class memberships
done by a classification system. In our case here,
the classification system classified the wiki pages
into two classes: high quality pages and ordinary
pages.
As indicated in Table 4, there were many
predictive variables used in the analysis, some of
them could be redundant, in the sense that their
contribution to the overall predictive power could
not ignored. To eliminate redundant variables in
order to save administrative works as well as com-
puting resources, we used stepwise discriminant
analysis algorithm (Huberty 1994) to reduce the
number of predictive variables. Mathematically
speaking, stepwise method cannot give a better
result, but it can decrease the number of predictive
variables dramatically.
In our stepwise approach, the first variable
included has the largest value for the selection
criterion, then the value of the criterion is re-
evaluated for all variables not in the model. The
remaining variable with the largest criterion value
is entered next. At this point, the variable which
was entered first is re-evaluated to determine
whether it meets the removal criterion. If it does, it
is removed from the model. Next, all variables not
in the equation are examined for entry, followed
by an examination of the variables in the equation
for removal. Variables were removed until none
remain that meet the removal criterion. Variable
selection terminates when no more variables meet
entry or removal criteria. Using this approach, we
reduced the number of predictive variables from
more than one hundred to only a few with only
 
Search WWH ::




Custom Search