Databases Reference
In-Depth Information
6.5 Sample Results
The algorithm outlined in Fig. 3 has been implemented as an extension to the
BioRAT tool [5]. As a proof of principle of the methods described here, we
used the implementation to derive a template which could be used to identify
protein-protein interactions from biological literature.
We start with the seed phrase “Rad53p protein binds to Dbf4p”, where
“Rad53p” and “Dbf4p” are proteins, and “binds to” suggests a direct physical
interaction. We create a positive corpus by extracting 500 abstracts that are
listed in the Database of Interacting Proteins (DIP [26]). We assume that most
of these abstracts mention at least one protein-protein interaction, although
the information will be expressed in many different ways. We also create a
neutral corpus by selecting the first 500 abstracts dated September 2005 re-
trieved from PubMed. This is essentially a random set of biomedical texts. It
may contain some references to particular protein-protein interactions, but
we assume that most of these abstracts will not.
To select the “best template” during the search (Fig. 3, Steps 4a and 5),
we define a quality score thus:
score = (number of positive matches) - w
(number of neutral matches)
The weight w is used to control the trade-off between true- and false-positives
and hence balance the search for high recall and high precision templates. (As
explained in Sect. 6.1, we are assuming that each match in the neutral corpus
is a false-positive result.) We start with a small value of w =0 . 5andslowly
increase this as the search progresses. This encourages early exploration of the
search space while later penalising false positive matches heavily. We select
the unevaluated template with the highest score at each step. In cases where
two or more templates have the same score, candidate templates are selected
if they contain more gazetteer elements and fewer literal elements than the
others. This deliberately introduces a slight bias favouring templates that
contain references to gazetteers, as these represent useful domain knowledge.
Table 1 summarises the search as it progresses for the first 50 iterations.
On looking at Table 1, we see that the template evaluated on the 30th
iteration matches 286 fragments from the positive corpus, and four from the
neutral corpus. The template pattern is: [ Γ : protein] [ : ?????] [ Γ : binding]
[ : ???] [ Γ : protein, sp]. This matches a protein followed by between 0 and 5
words, followed by a protein-binding term, followed by between 0 and 3 other
words, followed by another protein (of sub-type “sp”; this refers to terms in
a gazetteer derived from the SwissProt database). The five sentences listed
below were among those found in the positive corpus. The italicised portions
show the fragments matched by the template; the rest is given for context
only:
×
Protein kinase C delta associates with and phosphorylates Stat3 in an
interleukin-6-dependent manner.
Furthermore, Stat3 was phosphorylated by PKC delta in vivo on Ser-727 ...
Search WWH ::




Custom Search