Information Technology Reference
incorporated only parts of the total process. In addition, the accuracy of HaDex-
tract was 89.6% that is above the average accuracy of70-80% other systems.
2 Entity Recognizing in HaDextract System
In order to extract HLA-disease interaction information, HLA and disease enti-
ties in the literature need to be recognized in advance. Entities in the biomedical
domain show various surface realizations, which are called term variants. To deal
with the term variants, this thesis builds the regular expression of HLA and used
Mesh ontology and disease abbreviation patterns.
Even though HLA and disease entities are essential components in interaction
information, coexistence of HLA and disease entities within the same sentence
does not always guarantee their relation. Lexicalized information in sentence also
needs to be introduced. We used 25 words as 'relation keywords' and 5 words as
'filtering keywords'. Keywords such as 'Associate' and 'Correlate' showed a rela-
tion between HLA and disease, and filtering keywords such as 'Restricted' and
'specific' are used to filter sentences that do not contain HLA-disease interaction
Geographic locations entities was not necessary to extract interaction infor-
mation, but both were used to summarize interaction information. In addition,
we used abbreviations of disease entities found in literature to improve recall of
2.1 HLA Entity
There are three naming methods to indicate HLA entities: Antigen, Allele and
Gene group. The same HLA could be displayed differently depending on its
naming method. In table 1, it shows various instances named by serology, DNA,
and group antigen.
Instances named by serology antigen and DNA allele show similarity in ex-
pression. For example, 'A2' in Serology antigen appears as 'A*02' expression in
DNA antigen. In both naming methods, they have common that 'A2' appears
after 'HLA-' keyword. We built regular expressions in table 2 to find HLA in
serology antigen and DNA allele naming method.
Instances named by group antigen are different in expression with Serology
and DNA Antigen since Group Antigen is the naming method that focuses in
combination of alleles rather than specification of Allele. We used simple keyword
matching (dictionary-based approach) to find Group antigen in literatures.
Antigen, Allele, and Gene group
Antigen HLA-A2, A2, -A2, A2-transgenic, Bw4, Bw6, Cw1, DR1, DQ1
Allele HLA-A*02, -A*02, A*0201, A*020101, A*02010101, A*02010102L,
Gene group A1CREG, A2CREG, Bw4CREG, HLA-A, A*, HLA-A*