Question-Answering for Agricultural Open Data - Transactions on Large-Scale Data-and Knowledge-Centered Systems

Information Technology Reference

In-Depth Information

must match with ⓦ Cor ⓦ F. We then create clusters of the identical property

values for each property based on Longest Common Substring (LCS), add up

the PageRank values of the source web pages in each cluster, in order to excludes

errors of the extraction and of the information source, then to determine the best

possible property value and the second-best. Experienced gardeners finally select

a correct value for each property from the extracted values. If there are various

theories as to the correct value for the property, they selected the dominant one.

LOD Extraction Accuracy. The LOD extraction method was evaluated for

13 properties values of 90 plants. Table 1 shows precisions and recalls (avg.) of the

best possible value (1-best) separated by the whole process, the bootstrapping

method, and the dependency parsing. The precisions and recalls of the second-

best possible value (2-best) of the whole process is also shown in the table.

Although we retrieved more than 100 web pages for each plant, DOM parse

errors and difference of file types reduced the page amount to about 60%. In

the case that the sum of the PageRank values of two clusters are the same, two

values are regarded as the first position. In addition, the accuracy is calculated

in units of the cluster instead of each extracted value. In the case of 1-best, a

cluster which has the biggest PageRank value is an answer for the property. In

the case of 2-best, the two biggest clusters are compared with a correct value,

and if either of the answers is correct, it is regarded as correct. N-best precision

is defined as follows:

1

|D q |

r k

N − best precision

=

1

≤k≤N

,where

|D q |

is the number of correct answers for a query

q

,and

r k is a function

equaling 1 if the item at rank

is correct, zero otherwise.

The result of 1-best achieved a precision of 85% and a recall of 77%, and the

2-best achieved a precision of 97% and a recall of 87%. We thus confirmed that

it is possible to present the binary choice including a correct answer in many

cases. The automatic extraction will not be perfect after all, and then manual

checking is necessary at the final step. Therefore, the binary choice is a realistic

design. In more detail, the bootstrapping collected smaller amounts of values,

and the recall was lower than the dependency parsing. However, the precision

was higher than the dependency parsing. The reason is that data written in

tables was correctly extracted, but lacks diversity of properties. The dependency

parsing collected a large amount of values including many noisy data, and then

the total accuracy was affected by the dependency parsing. The reason is that

the biggest cluster of the PageRank value was composed of the values extracted

by the dependency parsing. We thus plan to set some weights on the values

extracted by the bootstrapping.

k

Transactions on Large-Scale Data-and Knowledge-Centered Systems

Search WWH ::

Custom Search

Home