Characterizing Genes by Marginal Expression Distribution - Advances in Computational Science and Engineering

Information Technology Reference

In-Depth Information

that is often skewed [3]. Additionally we use the standard normal distribution

as a control experiments on which we compare how well gamma and lognormal

mixture models perform.

Our choice of using gamma distribution is because of its flexible shape. Fur-

thermore it has been successfully used in many studies of biological systems

[4,5,6]. With regard to the lognormal distribution, there is a strong evidence

that this distribution appears in many biological phenomena [7]. In practice it

is also convenient for analyzing microarray data is because it is easy to perform

calculations and capable of determining the data z -scores, a possible common

unit for data comparison [8]. Below we describes the detail of our methods and

experimental results.

2 Methods

2.1 Statistical Model

Let

{x i },i =1 ,...,N denote the expression value of a gene probe, where N is the

total number of observations (samples). Under a mixture model, the probability

density function for observing finite data points x i is:

p ( x )= K

p ( x|j ) P ( j )

(1)

j =1

The density function for each component is denoted as p ( x|j ). In appendix we

give the formal description of density function from three types of distributions

used in our model. And P ( j ) denotes the prior probability of the data point

having been generated from component j of the mixture. These priors are chosen

to satisfy the constraints j =1 P ( j ) = 1. The log likelihood function of the data

is given by:

LL =

−logL =

−

log

p ( x i |j ) P ( j )

(2)

i =1

j =1

We use expectation-maximization (EM) algorithm [9] to learn mixture models

of normal, lognormal and gamma distribution for each probe's expression level.

It is implemented with R programming language. The EM algorithm iteratively

maximizes the loglikelihood and update the conditional probability that x comes

from K -th component. This is defined as

|x, θ A 1 , θ B 1 ,...,θ A K , θ B K ] (3)

The set of parameter [ θ A 1 , θ B 1 ,...,θ A K , θ B K ] is a maximizer of loglikelihood, for

given p ( x|j ). The EM algorithm iterates between an E-step where values p ( x|j ) ∗

are computed from the current parameter estimates, and M-step in which the

loglikelihood with each p ( x|j ) replaced by its current conditional expectation

p ( x|j ) ∗ is maximized with respect to the parameters θ A

p ( x|j ) ∗ = E [ p ( x|j )

and θ B .Thesetwo

Advances in Computational Science and Engineering

Search WWH ::

Custom Search

Home