Information Technology Reference
In-Depth Information
Table 3.
Stem of filtering and relation keyword
Class Keywords stems
Filter bound, restricted, specific, tetramer, transgenic
Relation associat, correlat, confer, decreas, development, differen, effect, role,
found, overexpress, overrepresent, frequent, greater, higher, increas,
linkage, lower, marker, predispos, progress, resist, protect, risk, suscept,
effector
Domain experts found words indicating HLA-disease relation in the literature and
we called those words relation keywords. However, in sentences containing HLA-
disease entities and even relation keywords, it was shown that sentences describ-
ing immune responses have little HLA-disease relation. In these sentences, HLA is
used for only supplementary information about genes and has no direct connection
with disease. We extract words indicating immune response test and called those
words filtering keyword. We use stems of keywords rather than keywords by them-
selves since relation and filtering keywords may appear in many different forms,
as in 'Correlate', 'Correlation', and 'Correlated'. We shall discuss how to use the
filtering keywords in table 3 in the information extraction section.
3.2 Parse Tree Search Algorithm
We started extracting information by generating parse trees for each sentence in
PubMed using collins parser. We selected Collins parser because collins parser
shows the highest precision among present parsers. In a parse tree, each terminal
node represents each word in the sentence and each nonterminal node represents
grammatical and dependency information between terminal nodes.
After building parse trees, we constructed our own information-extracting
algorithm using postorder traversal. To extract only HLA-disease relation in
a sentence, we first recognized HLA entities (H), disease entities (D), relation
keywords (A), and filtering keywords (F) in terminal nodes. Then, we searched
parse trees using postorder traversal. In postorder traversal, terminal nodes are
searched first. When searching algorithm visits terminal nodes recognized as
H, D, A, and F, it copies content(word) in terminal nodes to all of its parent
nodes. If a nonterminal node had all H, D and A in its subtree, H, D and A
words would be collected together in the nonterminal node. Then the searching
algorithm visits the nonterminal node and it will extract the essential relation
information, namely HaD information. The following enumerated steps are de-
picting the sample run on the tree of Fig.3.
1. Searching algorithm searches terminal nodes first.
2. Celiac disease is recognized as Disease entities and it is copied to all of its
parent nodes. (celiac disease
NPB
S1
SBAR
VP
VP
VP
S2)
Search WWH ::




Custom Search