Information Technology Reference
In-Depth Information
3.3
Acceptability Ranker
Processed by the Choice Generation System and the Paraphrase Generation System,
most source sentences are transformed into various statements with different testing
purposes and with different appearances. Obviously, we don't need all these for the
final application. A two-way classifier is trained to answer the question, “ can this
statement be accepted as a choice? ” The probability scores provided by the classifier
should help rank the choice candidates according to its acceptability in an assessment.
The features that the ranker is based on can be grouped into five types by function.
We combine features commonly used in QG as well as those that are frequently con-
cerned in paraphrase scoring. Surface features describe the appearance of the choice
candidate from the view of grammaticality and length. Vagueness features include
features that would tell the vagueness of the sentence. Grammar features [8] are part
of the vagueness features because the information of part-of-speech tags and the
grammatical structures may suggest how descriptive the sentence is. Transformation
rule features capture the inherent accuracy of each transformation rule. Replacement
features measure the quality of the replacement by considering the content and the
context of the replacing phrase and the replaced phrase. QG challenging features sug-
gest how challenging the choice candidate might be by features that summarize the
category and the extent of paraphrasing. There are 90 features in total.
4
Experiment and Results
4.1
Parameter Estimations
The parameters in Equation 1 is estimated according to the settings in [23] and the
optimization function in [22] with minor adjustment. The Acceptability Ranker is
trained on the data that are partly rated by two human experts. The other part is rated
by the workers on Amazon Mechanical Turk 3 (MTurk) service. The human experts
worked individually and the ratings of any Turker should correlate with the others to
at least a moderate degree on a batch basis. The raters were asked to rate the accepta-
bility on a Likert scale rating, where the definition follows [9]. From 1 to 5, the
acceptability score represents bad, unacceptable, borderline, acceptable and good,
respectively. We binarize the rating to have scores that exceed 3.5 as acceptable and
unacceptable otherwise. We also asked the raters to mark the choices as true or false,
given the article.
In total, 10 articles, with 1065 related statements that are generated by our work, are
annotated. 200 statements are randomly selected as the held-out test set while the rest are
on the training set for logistic regression. The Acceptability Ranker that we trained in this
work reflects an accuracy of 0.73 on the test set, as shown in Table 3. Since there is con-
cern that the working quality of Turkers might not be as good as human experts, we also
trained the Acceptability Ranker using only the data annotated by the human experts on
the training set and the ones by the Turkers, respectively. The former subset hits a higher
3 https://www.mturk.com/
Search WWH ::




Custom Search