Information Technology Reference
In-Depth Information
Table 1. The parameter pairs used as components for the simulated data generating
mixture models are shown
Parameters 1
2 3 4
θ A
0.10.1510
θ B
71436
parameter pairs shown in Table 1 was tried; yielding 4 mixture models with one
and three components and 6 = 2 mixture models with two components. The
mixing components ( P ( j ) in equation 1) were set to the uniform distribution in
each case. To aid comparison with the real data described below, we simulated
datasets with sizes 122 and 158.
Figure 1 and 2. indicates the observed effectiveness of information criteria
to induce the number of components in the generating model, from the gener-
ated data. BIC outperforms AIC, correctly inducing the number of components
in about 80% of the trials. Even for BIC, the error is slightly skewed toward
overpredicting the number of components. This suggests it may be possible to
further optimize the criteria for this task, but we did not pursue this possibility.
3.2 Real Dataset
As a preliminary study, we investigated the gene expression dataset from human
(GDS596) and mouse (GDS592) from GEO database [12] gene expression data
repository ( http://www.ncbi.nlm.nih.gov/geo/ ) . GDS596 contains data from
a study profiling 158 types of normal human tissue (22,283 probes) and GDS592
with 122 types of mouse tissues (31,373 probes) [13].
Likelihood and K-S Test Comparison. We evaluate the likelihood of mix-
ture models from three types of distributions on the real datasets. Furthermore
we evaluate goodness of fit of a model by using Kolmogorov-Smirnov (K-S) test
as represented by D , the maximum discrepancy in the cumulative probability
distribution, and a p-value statistic.
The goodness of fit p-value statistics, indicate that the gamma mixtures can fit
the marginal distribution of gene expression reasonable well. However, lognormal
mixtures fit better than gamma mixtures. Over all experiments they obtain
a higher likelihood than the gamma mixtures. The K-S test also confirm this
Table 2.
loglikelihood, D and p-value, averaged over each probe of the GDS596 dataset
Normal
Lognormal
Gamma
#Comp Loglik
D
p-value Loglik
D
p-value Loglik
D
p-value
1
-1063.02 1.04e-3
0.99
-212.85 2.30e-4
0.99
-968.59 3.00e-3
0.99
2
-979.13 1.74e-3
0.99
-205.58 4.55e-4
0.99
-955.11 1.60e-3
0.99
3
-952.29 2.72e-3
0.99
-204.55 7.04e-4
0.99
-963.20 1.19e-3
0.99
4
-913.66 4.63e-2
0.92
-203.05 8.27e-4
0.99
-967.20 9.88e-4
0.99
5
-881.67 1.55e-2
0.90
-201.78 1.26e-3
0.99
-968.56 5.77e-4
0.99
Search WWH ::




Custom Search