Information Technology Reference
In-Depth Information
In [15], the authors show that the SPT is effective in identifying the relation
between two entities mentioned in a segment of text. Given an instance, therefore we
first construct the smallest common sub-tree including the two proteins. In other
words, the sub-tree is enclosed by the shortest path linking the two proteins p i and p j
in the parse tree, which as shown in Fig. 4(b). Next, in order to make the IPT concise
and clear, we remove indiscriminative IPT elements. Frequent words are not useful
for expressing interactions between proteins. For instance, the word “ both ” in Fig.
4(c) is a common word and cannot discriminate interactive expressions. To remove
stop words and the corresponding syntactic elements from the IPT, we sort words
according to their frequency in the text corpus. Then, the most frequent words are
used to compile a stop word list. Moreover, to refine the list, protein names and verbs
are excluded from it because they are key constructs of protein-protein interactions.
Finally, the generated interaction patterns can help us capture the most prominent and
representative patterns for expressing PPI. Highlighting interaction patterns closely
associated with PPIs in an IPT would improve the interaction extraction performance.
For each IPT that matched an interaction pattern, we add an IP tag as a child of the
tree root to incorporate the interactive semantics into the IPT structure (as shown in
Fig. 4(d)).
A convolution kernel aims to capture structured information in terms of
substructures. Generally, we can represent a parse tree T by a vector of integer counts
of each sub-tree type (regardless of its ancestors):
ˆ
(
T
)
=
(#
subtree
(
T
),...,
#
subtree
(
T
),...,
#
subtree
(
T
))
, (4)
1
i
n
where #subtree i (T) is the occurrence number of the i th sub-tree type ( subtree i ) in T .
Since the number of different sub-trees is exponential with the parse tree size, it is
computationally infeasible to directly use the feature vector ψ (T) . To solve this
computational issue, we leverage the convolution tree kernel [3] to capture the
syntactic similarity between the above high dimensional vectors implicitly.
Specifically, the convolution tree kernel K CTK counts the number of common sub-trees
as the syntactic similarity between two rich interactive trees IPT 1 and IPT 2 as follows:
K
(
IPT
,
I
P
T
)
=
ʔ
(
n
,
n
)
(5)
CTK
1
2
1
2
n
N
,
n
N
1
1
2
2
where N 1 and N 2 are the sets of nodes in IPT 1 and IPT 2 respectively. In addition ∆( n 1 ,
n 2 ) evaluates the common sub-trees rooted at n 1 and n 2 and is computed recursively as
follows:
(1) if the productions (i.e. the nodes with their direct children)at n 1 and n 2 are
different, ∆( n 1 , n 2 ) = 0;
(2) else if both n1 and n2 are pre-terminals (POS tags), ∆( n 1 , n 2 )=1 ×
λ ;
(3) else calculate ∆(n 1 , n 2 ) recursively as:
#
ch
(
n
)
1
=
ʔ
(
n
,
n
)
=
ʻ
(
+
ʔ
(
ch
(
n
,
k
)
ch
(
n
,
k
)))
, (6)
1
2
1
2
k
1
Search WWH ::




Custom Search