Automatic Extraction of HLA-Disease Interaction Information from Biomedical Literature - Advances in Computational Science and Engineering

Information Technology Reference

In-Depth Information

Table 3.

Stem of filtering and relation keyword

Class Keywords stems

Filter bound, restricted, specific, tetramer, transgenic

Relation associat, correlat, confer, decreas, development, differen, effect, role,

found, overexpress, overrepresent, frequent, greater, higher, increas,

linkage, lower, marker, predispos, progress, resist, protect, risk, suscept,

effector

Domain experts found words indicating HLA-disease relation in the literature and

we called those words relation keywords. However, in sentences containing HLA-

disease entities and even relation keywords, it was shown that sentences describ-

ing immune responses have little HLA-disease relation. In these sentences, HLA is

used for only supplementary information about genes and has no direct connection

with disease. We extract words indicating immune response test and called those

words filtering keyword. We use stems of keywords rather than keywords by them-

selves since relation and filtering keywords may appear in many different forms,

as in 'Correlate', 'Correlation', and 'Correlated'. We shall discuss how to use the

filtering keywords in table 3 in the information extraction section.

3.2 Parse Tree Search Algorithm

We started extracting information by generating parse trees for each sentence in

PubMed using collins parser. We selected Collins parser because collins parser

shows the highest precision among present parsers. In a parse tree, each terminal

node represents each word in the sentence and each nonterminal node represents

grammatical and dependency information between terminal nodes.

After building parse trees, we constructed our own information-extracting

algorithm using postorder traversal. To extract only HLA-disease relation in

a sentence, we first recognized HLA entities (H), disease entities (D), relation

keywords (A), and filtering keywords (F) in terminal nodes. Then, we searched

parse trees using postorder traversal. In postorder traversal, terminal nodes are

searched first. When searching algorithm visits terminal nodes recognized as

H, D, A, and F, it copies content(word) in terminal nodes to all of its parent

nodes. If a nonterminal node had all H, D and A in its subtree, H, D and A

words would be collected together in the nonterminal node. Then the searching

algorithm visits the nonterminal node and it will extract the essential relation

information, namely HaD information. The following enumerated steps are de-

picting the sample run on the tree of Fig.3.

1. Searching algorithm searches terminal nodes first.

2. Celiac disease is recognized as Disease entities and it is copied to all of its

parent nodes. (celiac disease

NPB

S1

SBAR

VP

→

S2)

→

Advances in Computational Science and Engineering

Search WWH ::

Custom Search

Home