Information Technology Reference
the same time are denoted as level 1. 305 sentences in 909 abstracts are denoted
as level 1. The sentences in level 1 are divided again according to whether they
have relation information between HLA and diseases. If sentences don't have
this relation information, the sentences are denoted as level 2. If sentences have
the relation information, the sentences are denoted as level 3. 144 sentences in
305 sentences (level1) were confirmed as level 3 by a domain expert.
We tested the algorithm with 305 sentences 144 sentences in level 3. Finding
129 sentences as level 3, our information extraction system reported a precision
rate as 89.6%. We analyzed 15 misclassified sentences. In 4 sentences, words are
incorrectly recognized as HLA entities. For example, 'A375' is misidentified as
HLA entity by the system due to its similarity in expression with HLA entities. In
the similar way, words are incorrectly recognized as disease entities in 2 sentences.
7 sentences without relation information did not filter out during the process.
The system showed accuracy of 57.4% in summarization. 74 sentences in 144
sentences are correctly summarized. We applied strict criterion on the evaluation
of the summarization. Even if the algorithm missed any of entities or relations in
a sentence, we consider it as incorrect summarization. We also analyzed the 55
inappropriately summarized sentences: we failed to find all HLA entities in 7 sen-
tences, disease entities in 12 sentences, disease abbreviation in 4 sentences, and
HLA haplotype in 11 sentences. We failed to extract information due to incorrect
parsing tree in 14 sentences. Finding disease entities and disease abbreviations
shows the highest error rate in our analysis. The reason of the failure is that
searched disease entities were not complete due to the limitation of MeSH.
5 Results and Discussion
We collected 16,833 sentences containing HLA and disease entities at the same
time from 66,785 Pubmed abstracts which contain HLA keywords from 1979 to
2004. We found 6,654 sentences are in level 2 and 10,184 sentences are in level 3.
Therefore we collected 10,184 HLA-disease interaction information from 66,785
abstracts in PubMed automatically and offered it with its summary information
at our web site 1 .
The summary information is consisted of keywords of HLA, Action and Dis-
ease, which are found in the previous data mining process. The information is
divided into three categories: HLA, disease and geographic location. Each en-
tity in each category has HLA-disease interaction and we counted the number
of relevant sentences including relevant sentences of sub entities. For example,
HLA-A has A*02(1974) as sub entity. 1974 show the count of relevant sentences
of HLA-A*02. HLA A*02 has A*0201(296), A*0202(1), A*0203(3), A*0205(3),
A*0206(5), A*0207(12) and A*0211(2). The reason why the total count of sub
entities (322) does not match with the count of A*02(1974) is that A*02 has
its own count of relevant sentences. Thus, the HLA A*02's count of relevant
sentences will be 1652(=1974-322). We collect 23,438 sentences on HLA, 53,875
sentences on Disease and 3,436 sentences on geographic location. Fig.5 shows
1 HaDextract system (http://dataknow.korea.ac.kr/hadextract/index.php)