Information Technology Reference
levels in gene probes. To our knowledge this is the first study that provides such
a framework for analyzing expression data.
Although theoretically gamma distributions are capable of modeling skewed
distributions, our experiments showed that lognormal appears to be more suit-
able in modeling the marginal distribution of gene expression. We also showed
that amongst the two model selection criteria we used, BIC is more accurate in
selecting the number of components for lognormal and gamma mixtures. AIC
on the other hand tends to over estimate the number of components.
We hypothesize that different functional categories of genes ( e .g. transcrip-
tion factors, kinases, structural proteins, etc) may show similar marginal distri-
butions. Unfortunately this expectation is not clearly supported by our study.
Only the single, vague gene ontology term intracellular was found to be over-
represented in both datasets. We believe follow-up experiments are necessary to
determine if this is a due to the quantity/quality of the expression data used, a
deficiency in our methodology, or whether our hypothesis is simply wrong.
To achieve more definitive results we are now preparing to analyze a much
larger dataset including multiple GEO datasets. This will be essential to sample
the expression probes at the resolution needed to accurately model multimodal
marginal distributions. Our results should provide some guidance in the develop-
ment of informed priors or gene specific normalization for use with gene network
1. Hoyle, D., Rattray, M., Jupp, R., Brass, A.: Making sense of microarray data
distributions. Bioinformatics 18, 576-584 (2002)
2. Ji, Y., Wu, C., Liu, P., Wang, J., Coombes, K.R.: Applications of beta-mixture
models in bioinformatics. Bioinformatics 21(9), 2118-2122 (2005)
3. Kuznetsov, V.: Family of skewed distributions associated with the gene expression
and proteome evolution. Signal Process. 83(4), 889-910 (2003)
4. Mayrose, I., Friedman, N., Pupko, T.: A gamma mixture model better accounts
for among site rate heterogeneity. Bioinformatics 21(2), 151-158 (2005)
5. Dennis, B., Patil, G.P.: The gamma distribution and weighted multimodal gamma
distributions as models of population abundance. Mathematical Biosciences 68,
6. Keles, S.: Mixture modeling for genome-wide localization of transcription factors.
Biometrics 63(1), 2118-2122 (2007)
7. Limpert, E., Stahel, W., Abbt, M.: Log-normal distributions across the sciences:
keys and clues. Bioscience 51(5), 341-352 (2001)
8. Konishi, T.: Parametric treatment of cDNA microarray data. Genome Informat-
ics 7(13), 280-281 (2002)
9. Dempster, N.M., Laird, A.P., Rubin, D.B.: Maximum likelihood from incomplete
data via the EM algorithm. J.R. Stat. Soc. 39(B), 1-38 (1977)
10. Akaike, H.: Information theory and extension of the maximum likelihood principle.
In: Second International Symposium on Information Theory, pp. 267-281 (1973)
11. Schwarz, G.: Estimating the dimension of a model. The Annals of Statistics 6,