Information Technology Reference
In-Depth Information
experiments like DNAse I footprints or ElectroMobility gel Shift Assays (EMSA).
Unfortunately, the specific experiments conducted are rarely mentioned in the
abstract or title of the documents in question. The search for general key words
classifying this comprehensive field like “gene regulation”, “promoter” or “bind-
ing site” results in over 150,000 hits, and even with additional refinement only
10-20% contain appropriate data. Therefore it is necessary to screen all the hits
manually to obtain literature references suitable for the database annotation. Of
these, those are especially valuable that contain pictures of the DNAse I foot-
print or EMSA assay, because they represent verified information of high quality.
This quality assessment can be important on further exploration of the subject.
In this case study, the corpus included 188 papers that were known to contain
information about DNA binding sites (from the PRODORIC database). We
extracted 1430 pictures, about one quarter of them pictures of whole pages.
In data cleansing, we found that 13% of them were completely unreadable (the
oldest ones), due to text conversion errors. The extracted images showed scanned
pages of the paper. Another 10% did show fairly good text recognition, but had
the pictures not included separately, but as part of a whole page picture. Another
8% showed minor errors, like too short captions, not recognized figure blocks due
to text conversion errors, and so on, in some of the captions. All in all, for 80%
of the papers the captions could be indexed properly. The rest were set aside for
manual inspection.
To find DNAseI footprints the keywords “footprint”, “footprinting” and
“DNAse” were used to find the appropriate figures in CaptionSearch. Overall,
184 hits were scored of which 163 actually showed experimental data. As a by-
product, the thumbnails, presented by the engine, mostly suced to make a fast
quality assessment. Another positive effect was that the data was much faster
available than with the usual method of opening each PDF independently. The
search for EMSAs was a bit more dicult, since there is a wide range of naming
possibilities. The most significant terms in those names were “shift”, “mobility”,
“EMSA” and “EMS” to catch “EMS assay”. We had 91 hits of which 81 were
genuine. Recall could not be tested thoroughly, due the sheer numbers of pictures
and the limited time of experts, but a random sample did not include interesting
pictures that had not also been found by the keywords, which suggests a rather
high recall.
The second field study was conducted in collaboration with neurologists. They
were interested in finding paper under the topic of “mismatch negativity” and
make them searchable through our engine. Technically, the main difference be-
tween the two is the variety of age. The binding site corpus is from the years
1995 to 2003, while the mismatch negativity corpus only includes papers from
the years 2001 and 2002. And of course the general topic is different, one being
from Microbiology and one from Neurology.
In the newer corpus the problems were a little different. It includes 355 papers,
all from the years 2001 and 2002, containing 1754 extractable pictures. 2 of the
papers had to be omitted in the data cleansing step. We found that 31% of
the pictures were logos. Most of these occurred either on the first page, or on the
Search WWH ::




Custom Search