A Data Warehousing Approach for Genomics Data Meta-Analysis - Evolving Application Domains of Data Warehousing and Mining - page 146

Database Reference

In-Depth Information

downloaded from public repositories or may be

the result of local experiments.

For instance Table 9 gives an excerpt of ANOVA

results for the search of differentially expressed

genes between gene expressions at 7h and at 0h

on Control conditions in the GSE6281 series

described previously. The first three columns

give information on the gene probeset, the fourth

column gives the intensity ratio between condi-

tions, the fifth column give the p-value and the

last column give the differential expression level

(+, - for up- or down- regulation) deduced from

a threshold.

In AMI, we keep only synthetic data obtained

from p-values and fold-change that indicates up-

and down- regulation of genes over two conditions.

These data are stored in relational tables like raw

expression intensity data.

representation of Statistical results

As stated in the AMI overview, another main goal

in designing the semanticAMI data warehouse is to

provide the capability to keep in memory synthetic

data that are provided by statistical analyses, and

in a way that should facilitate information retrieval

on fuzzy and semantic criteria. In a first stage, we

plan to consider two kinds of statistical and data

mining results that are differential gene expres-

sion between two given conditions and clustering

models on gene expression intensities. Differential

gene expression are stored into relational tables

and clustering models are stored as XML repre-

sentations (represented by (5) in Figure 1); each

of them is discussed in following paragraphs.

Data Mining Models of

Gene Expressions

Differentially Expressed Genes

For storage and intelligent retrieval of data mining

models, standard representation formats like XML

and PMML and semantic annotations formats like

RDF are perfectly fitted toAMI requirements. The

Predictive Model Markup Language (PMML) is

an XML-based language that provides a way for

applications to define statistical and data mining

models and to share models between PMML

compliant applications. It was defined by the Data

Mining Group 18 . In this section, we present PMML

extensions we have defined for clustering models.

RDF annotations are detailed in next sections. A

PMML clustering model as illustrated by Figure

While gene expression data table have much

more lines (genes) than columns (conditions), for

instance 50000 genes for 50 conditions, analyses

of pair-wise differentially expression among

conditions provide huge amount of resulting data

too. Search for differentially expressed genes is

frequently processed by one-way or two-ways

ANOVA algorithms.As presented previously (see

section “MICRO-ARRAYS EXPERIMENTS”),

an ANOVA method will provide results as a list

of genes with their p-values over two conditions.

Table 9. Example of results after an ANOVA on gene expression

Probe

Symbol

Description

Fold change7h/0h

p-value7h/0h

Exp

240717_at

ABCB5

ATP-binding cassette

sub-family B

(MDR/TAP)

0.5235

0.01219

+

232081_at

ABCG1

ATP-binding cassette

sub-family G

(WHITE)

1.6253

0.00124

-

. . .

. . .

. . .

. . .

. . .

. . .

Next Page

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home