Information Technology Reference
In-Depth Information
must match with Cor F. We then create clusters of the identical property
values for each property based on Longest Common Substring (LCS), add up
the PageRank values of the source web pages in each cluster, in order to excludes
errors of the extraction and of the information source, then to determine the best
possible property value and the second-best. Experienced gardeners finally select
a correct value for each property from the extracted values. If there are various
theories as to the correct value for the property, they selected the dominant one.
LOD Extraction Accuracy. The LOD extraction method was evaluated for
13 properties values of 90 plants. Table 1 shows precisions and recalls (avg.) of the
best possible value (1-best) separated by the whole process, the bootstrapping
method, and the dependency parsing. The precisions and recalls of the second-
best possible value (2-best) of the whole process is also shown in the table.
Although we retrieved more than 100 web pages for each plant, DOM parse
errors and difference of file types reduced the page amount to about 60%. In
the case that the sum of the PageRank values of two clusters are the same, two
values are regarded as the first position. In addition, the accuracy is calculated
in units of the cluster instead of each extracted value. In the case of 1-best, a
cluster which has the biggest PageRank value is an answer for the property. In
the case of 2-best, the two biggest clusters are compared with a correct value,
and if either of the answers is correct, it is regarded as correct. N-best precision
is defined as follows:
1
|D q |
r k
N − best precision
=
1
≤k≤N
,where
|D q |
is the number of correct answers for a query
q
,and
r k is a function
equaling 1 if the item at rank
is correct, zero otherwise.
The result of 1-best achieved a precision of 85% and a recall of 77%, and the
2-best achieved a precision of 97% and a recall of 87%. We thus confirmed that
it is possible to present the binary choice including a correct answer in many
cases. The automatic extraction will not be perfect after all, and then manual
checking is necessary at the final step. Therefore, the binary choice is a realistic
design. In more detail, the bootstrapping collected smaller amounts of values,
and the recall was lower than the dependency parsing. However, the precision
was higher than the dependency parsing. The reason is that data written in
tables was correctly extracted, but lacks diversity of properties. The dependency
parsing collected a large amount of values including many noisy data, and then
the total accuracy was affected by the dependency parsing. The reason is that
the biggest cluster of the PageRank value was composed of the values extracted
by the dependency parsing. We thus plan to set some weights on the values
extracted by the bootstrapping.
k
Search WWH ::




Custom Search