Generating Comprehension Questions Using Paraphrase - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

3.3

Acceptability Ranker

Processed by the Choice Generation System and the Paraphrase Generation System,

most source sentences are transformed into various statements with different testing

purposes and with different appearances. Obviously, we don't need all these for the

final application. A two-way classifier is trained to answer the question, “ can this

statement be accepted as a choice? ” The probability scores provided by the classifier

should help rank the choice candidates according to its acceptability in an assessment.

The features that the ranker is based on can be grouped into five types by function.

We combine features commonly used in QG as well as those that are frequently con-

cerned in paraphrase scoring. Surface features describe the appearance of the choice

candidate from the view of grammaticality and length. Vagueness features include

features that would tell the vagueness of the sentence. Grammar features [8] are part

of the vagueness features because the information of part-of-speech tags and the

grammatical structures may suggest how descriptive the sentence is. Transformation

rule features capture the inherent accuracy of each transformation rule. Replacement

features measure the quality of the replacement by considering the content and the

context of the replacing phrase and the replaced phrase. QG challenging features sug-

gest how challenging the choice candidate might be by features that summarize the

category and the extent of paraphrasing. There are 90 features in total.

4

Experiment and Results

4.1

Parameter Estimations

The parameters in Equation 1 is estimated according to the settings in [23] and the

optimization function in [22] with minor adjustment. The Acceptability Ranker is

trained on the data that are partly rated by two human experts. The other part is rated

by the workers on Amazon Mechanical Turk 3 (MTurk) service. The human experts

worked individually and the ratings of any Turker should correlate with the others to

at least a moderate degree on a batch basis. The raters were asked to rate the accepta-

bility on a Likert scale rating, where the definition follows [9]. From 1 to 5, the

acceptability score represents bad, unacceptable, borderline, acceptable and good,

respectively. We binarize the rating to have scores that exceed 3.5 as acceptable and

unacceptable otherwise. We also asked the raters to mark the choices as true or false,

given the article.

In total, 10 articles, with 1065 related statements that are generated by our work, are

annotated. 200 statements are randomly selected as the held-out test set while the rest are

on the training set for logistic regression. The Acceptability Ranker that we trained in this

work reflects an accuracy of 0.73 on the test set, as shown in Table 3. Since there is con-

cern that the working quality of Turkers might not be as good as human experts, we also

trained the Acceptability Ranker using only the data annotated by the human experts on

the training set and the ones by the Turkers, respectively. The former subset hits a higher

3 https://www.mturk.com/

Technologies and Applications of Artificial Intelligence

Search WWH ::

Custom Search

Home