Information Technology Reference
In-Depth Information
The IE module in our software system is similar to the systems described above. But
there is a main difference: in almost all of the projects mentioned above the extracted
information is presented to the user in form of a structured representation. Our
application goes a step further in that the extracted information is used to
automatically construct a genetic network and thus contributes to the process of
transformation of information into knowledge. There is no direct interaction with the
user in which the result is controlled visually by the user. Information extraction
systems often sacrifice precision for recall, or vice versa. If a system is tuned to have
a good recall, it often extracts more than it should (bad precision). In our case, where
the system should give an interpretation of the data this would possibly lead to a
wrong hypothesis and may form the basis for further experiments. To avoid this
problem our system should rather have a high precision than a high recall. It is more
tolerable to miss a relation than to indicate a wrong relationship.
The last paragraph outlines the design of our IE module. The input to the IE module is
a list of gene identifiers, i.e. genes classified by the neural network as a typical gene
pattern. For each gene identifier a list of gene names and synonyms is generated using
a precompiled dictionary. This set of synonyms is used as a basis for the search in the
PubMed abstracts. Together with the synonyms a set of terms describing the causal
relationship of genes is used to further specify the query. These terms are listed
manually after analysing a set of relevant PubMed abstracts. For example the terms
'co-expressed' or 'co-regulated' are often used to describe the causal relationship we
are looking for. The result of the query is a set of abstracts which is downloaded for
further processing. The next step in the text analysis is to find and extract the
information needed. To fulfil these tasks we have to go through several phases which
contain more or less the tasks described above. The first phase can be described as a
text-pre-processing phase. In this phase a tagger is used to put annotations to each
word or symbol. These annotations are used in the subsequent steps to identify the
predefined entities like gene or protein names. Once these entities are recognized, a
set of rules is used to identify the relationship between the entities. In the last phase
the extracted information is filled into pre-specified templates which are than
transformed into database records. These database records are than used for the
generation of genetic networks.
5 Visualization of Genetic Networks
Resulting genetic networks - consisting of a set of genes and causal relations between
them - are presented in a static 3D structure by a visualization tool, which is developed
in Inprise Delphi integrating the technology of OpenGL. Genes are presented as globes
with expression labels or identifiers of relevant internet databases (members of
tripartite: GenBank, EMBL and DDBJ or GeNet) to be chosen optionally. Genes will be
linked by arrows if they are related. In future we will develop interactive components
for users to choose a set of related genes and zoom into the genetic network. First
results of utilizing several components of our software system separately are available.
Networks of Drosophila and Sea Urchin we obtained from internet database GeNet
information. Gene relation information for Drosophila and Sea Urchin are mined from
Search WWH ::




Custom Search