Information Technology Reference
In-Depth Information
accuracy of 0.7596 while the sub data set by Turkers reaches a significantly lower accu-
racy of 0.6875, suggesting that the work done by Turkers might be less consistent.
Table 3. Accuracy of the Acceptability Ranker
HE + MTurk
HE
MTurk
Accuracy
0.73
0.7596
0.6875
HE : the data tagged by human experts
MTurk : the data tagged by Turkers
4.2
Experimental Settings
To show the overall performance, we evaluate the top-ranked statements from the
view of question generation. The baseline system is proposed by Heilman and Smith
[7], which is also intended to facilitate QG and outputs statements. Since the baseline
is included in our system as the simplification component, the effect of adding other
components could be shown.
Two articles, one from BBC news (22 sentences) and the other from GSAT Eng-
lish 2009 (15 sentences), are randomly selected. They represent different writing
styles, one as news report and the other in a more formal way. They are processed by
both the baseline and our system into factual statements. Two human experts, gradu-
ate students who are non-native English speakers but with high English proficiency,
are asked to fulfill half of the rating work. A moderate degree of Pearson correlation
coefficient is achieved. The evaluation metrics include grammaticality (1-5), make-
sense (1-3), challenging score (1-3) and overall quality (1-5).
For each article, the baseline generated around 20-35 simplified statements while
our system generated over 700 variations. All the statements from the baseline are
evaluated. Since these statements cover all source sentences in the input, to make a
fair comparison, the top-5 choice candidates for each source sentence are generated
by our system for evaluation.
4.3
Experimental Results
If all transformations go well without errors, the transformation rules should deter-
mine whether the choice is true or false. A contingency table that summarizes the
intended correctness and the actual correctness is shown in Table 4. The statistics are
summed up based on the training and the testing data for the Acceptability Ranker. In
consideration of the quality of work on MTurk, as Table 3 suggests, we only take the
human-annotated data for evaluation in order to obtain more reliable results. Exclud-
ing the choice candidates that are unacceptable, 83% of the correctness labels remain
identical as planned. For statements that are made to be true, 94% of them are suc-
cessful. On the contrary, for statements that are designed to be distractors, a lower
ratio of 75% is attained. True statements are more likely to maintain their correctness
Search WWH ::




Custom Search