Characterizing Genes by Marginal Expression Distribution - Advances in Computational Science and Engineering - page 167

Information Technology Reference

In-Depth Information

Table 1. The parameter pairs used as components for the simulated data generating

mixture models are shown

Parameters 1

2 3 4

θ A

0.10.1510

θ B

71436

parameter pairs shown in Table 1 was tried; yielding 4 mixture models with one

and three components and 6 = 2 mixture models with two components. The

mixing components ( P ( j ) in equation 1) were set to the uniform distribution in

each case. To aid comparison with the real data described below, we simulated

datasets with sizes 122 and 158.

Figure 1 and 2. indicates the observed effectiveness of information criteria

to induce the number of components in the generating model, from the gener-

ated data. BIC outperforms AIC, correctly inducing the number of components

in about 80% of the trials. Even for BIC, the error is slightly skewed toward

overpredicting the number of components. This suggests it may be possible to

further optimize the criteria for this task, but we did not pursue this possibility.

3.2 Real Dataset

As a preliminary study, we investigated the gene expression dataset from human

(GDS596) and mouse (GDS592) from GEO database [12] gene expression data

repository ( http://www.ncbi.nlm.nih.gov/geo/ ) . GDS596 contains data from

a study profiling 158 types of normal human tissue (22,283 probes) and GDS592

with 122 types of mouse tissues (31,373 probes) [13].

Likelihood and K-S Test Comparison. We evaluate the likelihood of mix-

ture models from three types of distributions on the real datasets. Furthermore

we evaluate goodness of fit of a model by using Kolmogorov-Smirnov (K-S) test

as represented by D , the maximum discrepancy in the cumulative probability

distribution, and a p-value statistic.

The goodness of fit p-value statistics, indicate that the gamma mixtures can fit

the marginal distribution of gene expression reasonable well. However, lognormal

mixtures fit better than gamma mixtures. Over all experiments they obtain

a higher likelihood than the gamma mixtures. The K-S test also confirm this

Table 2.

loglikelihood, D and p-value, averaged over each probe of the GDS596 dataset

Normal

Lognormal

Gamma

#Comp Loglik

D

p-value Loglik

D

p-value Loglik

D

p-value

1

-1063.02 1.04e-3

0.99

-212.85 2.30e-4

0.99

-968.59 3.00e-3

0.99

2

-979.13 1.74e-3

0.99

-205.58 4.55e-4

0.99

-955.11 1.60e-3

0.99

3

-952.29 2.72e-3

0.99

-204.55 7.04e-4

0.99

-963.20 1.19e-3

0.99

4

-913.66 4.63e-2

0.92

-203.05 8.27e-4

0.99

-967.20 9.88e-4

0.99

5

-881.67 1.55e-2

0.90

-201.78 1.26e-3

0.99

-968.56 5.77e-4

0.99

Next Page

Advances in Computational Science and Engineering

Search WWH ::

Custom Search

Home