An Interaction Pattern Kernel Approach for Protein-Protein Interaction Extraction from Biomedical Literature - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

In [15], the authors show that the SPT is effective in identifying the relation

between two entities mentioned in a segment of text. Given an instance, therefore we

first construct the smallest common sub-tree including the two proteins. In other

words, the sub-tree is enclosed by the shortest path linking the two proteins p i and p j

in the parse tree, which as shown in Fig. 4(b). Next, in order to make the IPT concise

and clear, we remove indiscriminative IPT elements. Frequent words are not useful

for expressing interactions between proteins. For instance, the word “ both ” in Fig.

4(c) is a common word and cannot discriminate interactive expressions. To remove

stop words and the corresponding syntactic elements from the IPT, we sort words

according to their frequency in the text corpus. Then, the most frequent words are

used to compile a stop word list. Moreover, to refine the list, protein names and verbs

are excluded from it because they are key constructs of protein-protein interactions.

Finally, the generated interaction patterns can help us capture the most prominent and

representative patterns for expressing PPI. Highlighting interaction patterns closely

associated with PPIs in an IPT would improve the interaction extraction performance.

For each IPT that matched an interaction pattern, we add an IP tag as a child of the

tree root to incorporate the interactive semantics into the IPT structure (as shown in

Fig. 4(d)).

A convolution kernel aims to capture structured information in terms of

substructures. Generally, we can represent a parse tree T by a vector of integer counts

of each sub-tree type (regardless of its ancestors):

(

)

subtree

(

),...,

subtree

(

),...,

subtree

(

))

, (4)

where #subtree i (T) is the occurrence number of the i th sub-tree type ( subtree i ) in T .

Since the number of different sub-trees is exponential with the parse tree size, it is

computationally infeasible to directly use the feature vector ψ (T) . To solve this

computational issue, we leverage the convolution tree kernel [3] to capture the

syntactic similarity between the above high dimensional vectors implicitly.

Specifically, the convolution tree kernel K CTK counts the number of common sub-trees

as the syntactic similarity between two rich interactive trees IPT 1 and IPT 2 as follows:

∑

∈

(

IPT

)

(

)

(5)

CTK

∈

where N 1 and N 2 are the sets of nodes in IPT 1 and IPT 2 respectively. In addition ∆( n 1 ,

n 2 ) evaluates the common sub-trees rooted at n 1 and n 2 and is computed recursively as

follows:

(1) if the productions (i.e. the nodes with their direct children)at n 1 and n 2 are

different, ∆( n 1 , n 2 ) = 0;

(2) else if both n1 and n2 are pre-terminals (POS tags), ∆( n 1 , n 2 )=1 ×

λ ;

(3) else calculate ∆(n 1 , n 2 ) recursively as:

(

)

∏ =

(

)

(

)

(

)))

, (6)

Technologies and Applications of Artificial Intelligence

Search WWH ::

Custom Search

Home