Information Technology Reference
In-Depth Information
Fig. 1.7. Evaluation results
The accuracy of the workflow was tested by using randomly selected gene or
protein names from tables of the PDF dataset that was downloaded by using the
terms “gene expression microarray rat” on the PubMed database. The queries
produced 598 hits. A sample of 16 gene or protein names was tested. The queries
produced from 1 to more then 800 hits. Due to this magnitude we only tested the
first 40 hits for each of the 16 queries, so that a total of 506 hits were evaluated.
The hits were analysed in several aspects. The first one was to see whether
it was a table entry or a text passage and if some of the analysed hits were
duplications. Errors in this section usually derive from the table identification
algorithm. Duplication occurred in approximately 11% of the cases. 24% of the
hits were text entries (also cf. Fig. 1.7).
The next step was to see if the caption of the table entry is correct. We
found different problems; the caption can be too short (this only occurred two
times) or too long (about 22%). In the case of too long, usually an additional
word sneaked in. This class of errors derives from problems in the PDF-text-
conversion. Another problem was the lack of a genuine caption or the wrong
caption for the table (the latter occurred in 8% of the cases). Both problems
most likely come from the caption matching algorithm. All in all, we found that
in 65% of the hits, the gene name is in a table just as intended. We believe that
this is an acceptable precision, given that text entries also tend to give additional
information about the gene.
All of the queries identified the table their terms were originally taken from.
That places the recall at 100%. Given that the small sample size might distort
that number, we tried a number of other terms, not being able to produce a
failure. So, we suspect the recall to be very high, although that is hard to verify,
due to the frequent high number of hits.
The participating biologist was amazed and very interested to see the diversity
of contexts her protein was mentioned in. Most of those she would have never
discovered via conventional literature review, looking only for papers already
related to the problem she is working on or with the protein name in the abstract
of the paper.
Search WWH ::




Custom Search