Information Technology Reference
In-Depth Information
methods with graphical representation of the primary data. And to give in this way the
scientists a representation of complex gene expression data that, through statistical
organization and graphical display, allows them to assimilate and explore the data in a
natural intuitive manner.
Principal Component Analyses (PCA). One of the most complex multivariate
methods of statistics is PCA, which preferably allows to simplify high-complex data
records by reduction of dimensions. In this sense it is a useful mathematical
framework for processing expression data. But this approach leads to difficulties in
the interpretation of the results. The fact is known in general, but it complicates itself
just within a brand new and high-complex research context such as the field of
genomics. Nevertheless scientists from Stanford e.g. describe a method for singular
value decomposition (SVD) in transforming genome-wide expression data from genes
x arrays space to reduced space for processing and modelling [1] .
Correlations between Genes. And last but not least there is often spoken about
correlations in the context of gene expression analysis. Correlation is a general term
for independence between pairs of variables [14] . Therewith is often meant a look at
the scatter plot of for instance two experiments to detect something like a cigar in the
plot which means the genes seem to be linear correlated in their expression. This is a
kind of preliminary investigation to proceed a correlation analysis proving the
existence of a correlation. A metric is to develop describing the similarity of two
genes over a series of conditions. This could be a proper correlation coefficient as
Pearson or Kendall e.g. as an index that quantifies the linear relationship between a
pair of variables. And furthermore this coefficient can be tested under distribution
assumptions. The coefficient takes values between [-1, +1], with the sign indicating
the direction of the relationship and the numerical magnitude its strength. Values of -
1 or 1 indicate that the sample values fall on a straight line. A value of zero indicates
the lack of any linear relationship between two variables. So the result will give a
proposition about the existence of the hypothetical correlation. Or better said: at the
best the result will be a proof that there is not no correlation existing. And to come to
end: whether this proved correlation is indeed a causal one as wanted - this is
something else.
In this paragraph we gave a short overview about several mostly statistical methods
usually applied to mine genomic expression data. In the next paragraph we will
introduce a whole software system architecture which is able to perform modelling of
genetic networks using the results from micro array expression data analysis.
3 A System Architecture for Mining Micro Array Expression Data
Mining genomic data obtained in the labs of physicians or biologists is a research task
from high topicality. But as well frequently are found applications in which text
information from journals as results from lab data mining should be mined in turn. In
this context we find often applications of methods like parsing documents of text for
special words or phrases, methods like information retrieval as a collection of data
and information, methods like information extraction to convert no-structured text
into structures for storing it into a database.
But it is not done with mining information and put them into a structured form for a
database storing. Our intention goes still a step further: we want to mine data with the
Search WWH ::




Custom Search