Automatic Extraction of HLA-Disease Interaction Information from Biomedical Literature - Advances in Computational Science and Engineering

Information Technology Reference

In-Depth Information

HLA perform an important role in human immunity and has special allelic

pairs in each person. The knowledge of the alleles of HLA's main 6 genes (HLA-

A, -B -C, DRB1, DQB1, DPB1) is continually developing and it was reported in

2004 that 1,729 alleles were found. However this number has been exponentially

increasing every year. A person's allelic makeup can influence their response to

disease. Even though a person might be infected with the same microorganism,

their responses may vary from self-healing to serious disease. Because HLA al-

lele frequency differs according to geographic location, considerable number of

studies has carried out into the relationship between HLA allele frequency and

disease but still little is known. Relation between HLA and IE is found though

textmining technique such as Named Entity Recognition(NER) and Information

Extraction(IE).

There have been various attempts to eciently find entities within biomedical

literatures. Hanisch[1] found protein names that appear in biomedical text using

search terms of protein names. Hatzivassiloglou[2] and Kazama[3] used machine

learning approaches with word formation pattern, POS information, semantic

information, prefix, sux, and et al. The performance of these methods is about

60-80%.

There have also been numerous attempts to find interactions between entities

used in literature. Friedman[4] and Temkin[5] extracted protein-protein

interactions in biomedical abstracts using keywords and grammars built by domain

experts. Leroy[6] used Finite State Automata(FSA) with closed words, and demon-

strated that FSA can extract information in literature. McDonald[7] generated a

potential parse tree using their parser and filtered out parse trees with little infor-

mation. Filtering algorithm are used to select informative parse trees with valid

interaction information among potential parse trees. This method has the advan-

tage that grammar is not necessary to extract information. Horn[8] extracted in-

teraction information between protein and point mutations rather than extracting

information between proteins. Novichkova[9]introduce a general biomedical

domain-oriented system that can extract various biomedical information.

In this paper, to deal with the HLA names variants, we build the regular ex-

pression of HLA and used MeSH ontology. In this study, we intended to extract

interaction information between HLA and disease using textmining methods. we

make use of the structural information of the sentences with aim of finding in-

teractions between HLA and disease. The structural information of a sentence is

derived through applying parse tree to the dependency relationship of the key-

words in the sentence. The systems of McDonald[7] uses the potential parse tree

using their parser while our system uses the parse tree through the dependency

relationships between the keywords. This method analyzes more effectively in-

volved sentence and extracts more accuracy relation information between entities

which consists of a coordinating conjunction, 'and' and 'or', etc.

Our system is divided to 5 sub-processes: Tokenizing, Pos tagging, Entity Rec-

ognizing, Syntactic Analysis and Semantic Analysis. While HaDextract system in-

corporated all 5 sub processes including hidden relation, other data mining systems

Advances in Computational Science and Engineering

Search WWH ::

Custom Search

Home