Genomic Data Explosion – The Challenge for Bioinformatics? - Advances in Data Mining

Information Technology Reference

In-Depth Information

The IE module in our software system is similar to the systems described above. But

there is a main difference: in almost all of the projects mentioned above the extracted

information is presented to the user in form of a structured representation. Our

application goes a step further in that the extracted information is used to

automatically construct a genetic network and thus contributes to the process of

transformation of information into knowledge. There is no direct interaction with the

user in which the result is controlled visually by the user. Information extraction

systems often sacrifice precision for recall, or vice versa. If a system is tuned to have

a good recall, it often extracts more than it should (bad precision). In our case, where

the system should give an interpretation of the data this would possibly lead to a

wrong hypothesis and may form the basis for further experiments. To avoid this

problem our system should rather have a high precision than a high recall. It is more

tolerable to miss a relation than to indicate a wrong relationship.

The last paragraph outlines the design of our IE module. The input to the IE module is

a list of gene identifiers, i.e. genes classified by the neural network as a typical gene

pattern. For each gene identifier a list of gene names and synonyms is generated using

a precompiled dictionary. This set of synonyms is used as a basis for the search in the

PubMed abstracts. Together with the synonyms a set of terms describing the causal

relationship of genes is used to further specify the query. These terms are listed

manually after analysing a set of relevant PubMed abstracts. For example the terms

'co-expressed' or 'co-regulated' are often used to describe the causal relationship we

are looking for. The result of the query is a set of abstracts which is downloaded for

further processing. The next step in the text analysis is to find and extract the

information needed. To fulfil these tasks we have to go through several phases which

contain more or less the tasks described above. The first phase can be described as a

text-pre-processing phase. In this phase a tagger is used to put annotations to each

word or symbol. These annotations are used in the subsequent steps to identify the

predefined entities like gene or protein names. Once these entities are recognized, a

set of rules is used to identify the relationship between the entities. In the last phase

the extracted information is filled into pre-specified templates which are than

transformed into database records. These database records are than used for the

generation of genetic networks.

5 Visualization of Genetic Networks

Resulting genetic networks - consisting of a set of genes and causal relations between

them - are presented in a static 3D structure by a visualization tool, which is developed

in Inprise Delphi integrating the technology of OpenGL. Genes are presented as globes

with expression labels or identifiers of relevant internet databases (members of

tripartite: GenBank, EMBL and DDBJ or GeNet) to be chosen optionally. Genes will be

linked by arrows if they are related. In future we will develop interactive components

for users to choose a set of related genes and zoom into the genetic network. First

results of utilizing several components of our software system separately are available.

Networks of Drosophila and Sea Urchin we obtained from internet database GeNet

information. Gene relation information for Drosophila and Sea Urchin are mined from

Advances in Data Mining

Search WWH ::

Custom Search

Home