Generating Comprehension Questions Using Paraphrase - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

accuracy of 0.7596 while the sub data set by Turkers reaches a significantly lower accu-

racy of 0.6875, suggesting that the work done by Turkers might be less consistent.

Table 3. Accuracy of the Acceptability Ranker

HE + MTurk

HE

MTurk

Accuracy

0.73

0.7596

0.6875

HE : the data tagged by human experts

MTurk : the data tagged by Turkers

4.2

Experimental Settings

To show the overall performance, we evaluate the top-ranked statements from the

view of question generation. The baseline system is proposed by Heilman and Smith

[7], which is also intended to facilitate QG and outputs statements. Since the baseline

is included in our system as the simplification component, the effect of adding other

components could be shown.

Two articles, one from BBC news (22 sentences) and the other from GSAT Eng-

lish 2009 (15 sentences), are randomly selected. They represent different writing

styles, one as news report and the other in a more formal way. They are processed by

both the baseline and our system into factual statements. Two human experts, gradu-

ate students who are non-native English speakers but with high English proficiency,

are asked to fulfill half of the rating work. A moderate degree of Pearson correlation

coefficient is achieved. The evaluation metrics include grammaticality (1-5), make-

sense (1-3), challenging score (1-3) and overall quality (1-5).

For each article, the baseline generated around 20-35 simplified statements while

our system generated over 700 variations. All the statements from the baseline are

evaluated. Since these statements cover all source sentences in the input, to make a

fair comparison, the top-5 choice candidates for each source sentence are generated

by our system for evaluation.

4.3

Experimental Results

If all transformations go well without errors, the transformation rules should deter-

mine whether the choice is true or false. A contingency table that summarizes the

intended correctness and the actual correctness is shown in Table 4. The statistics are

summed up based on the training and the testing data for the Acceptability Ranker. In

consideration of the quality of work on MTurk, as Table 3 suggests, we only take the

human-annotated data for evaluation in order to obtain more reliable results. Exclud-

ing the choice candidates that are unacceptable, 83% of the correctness labels remain

identical as planned. For statements that are made to be true, 94% of them are suc-

cessful. On the contrary, for statements that are designed to be distractors, a lower

ratio of 75% is attained. True statements are more likely to maintain their correctness

Technologies and Applications of Artificial Intelligence

Search WWH ::

Custom Search

Home