A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

6.5 Sample Results

The algorithm outlined in Fig. 3 has been implemented as an extension to the

BioRAT tool [5]. As a proof of principle of the methods described here, we

used the implementation to derive a template which could be used to identify

protein-protein interactions from biological literature.

We start with the seed phrase “Rad53p protein binds to Dbf4p”, where

“Rad53p” and “Dbf4p” are proteins, and “binds to” suggests a direct physical

interaction. We create a positive corpus by extracting 500 abstracts that are

listed in the Database of Interacting Proteins (DIP [26]). We assume that most

of these abstracts mention at least one protein-protein interaction, although

the information will be expressed in many different ways. We also create a

neutral corpus by selecting the first 500 abstracts dated September 2005 re-

trieved from PubMed. This is essentially a random set of biomedical texts. It

may contain some references to particular protein-protein interactions, but

we assume that most of these abstracts will not.

To select the “best template” during the search (Fig. 3, Steps 4a and 5),

we define a quality score thus:

score = (number of positive matches) - w

(number of neutral matches)

The weight w is used to control the trade-off between true- and false-positives

and hence balance the search for high recall and high precision templates. (As

explained in Sect. 6.1, we are assuming that each match in the neutral corpus

is a false-positive result.) We start with a small value of w =0 . 5andslowly

increase this as the search progresses. This encourages early exploration of the

search space while later penalising false positive matches heavily. We select

the unevaluated template with the highest score at each step. In cases where

two or more templates have the same score, candidate templates are selected

if they contain more gazetteer elements and fewer literal elements than the

others. This deliberately introduces a slight bias favouring templates that

contain references to gazetteers, as these represent useful domain knowledge.

Table 1 summarises the search as it progresses for the first 50 iterations.

On looking at Table 1, we see that the template evaluated on the 30th

iteration matches 286 fragments from the positive corpus, and four from the

neutral corpus. The template pattern is: [ Γ : protein] [ Ω : ?????] [ Γ : binding]

[ Ω : ???] [ Γ : protein, sp]. This matches a protein followed by between 0 and 5

words, followed by a protein-binding term, followed by between 0 and 3 other

words, followed by another protein (of sub-type “sp”; this refers to terms in

a gazetteer derived from the SwissProt database). The five sentences listed

below were among those found in the positive corpus. The italicised portions

show the fragments matched by the template; the rest is given for context

only:

×

•

Protein kinase C delta associates with and phosphorylates Stat3 in an

interleukin-6-dependent manner.

•

Furthermore, Stat3 was phosphorylated by PKC delta in vivo on Ser-727 ...

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home