Biology Reference
In-Depth Information
or ambiguous (M). Before clustering the array data, we filtered the data to remove
unreliable data. In particular, we retained all genes for which all the time points
were present (4105 genes), all the genes for which greater than 50% of the time
points were present, and all the genes for which the present/absent calls exhibited
a biologically relevant pattern (e.g. PAAA for the four time points in the experi-
ment, suggested repression of gene expression over the course of the experiment).
In all, we retained 5652 genes. The expression patterns for these genes are then
z-normalized over each gene.
16.2.2. Theoretical and Computational Framework
16.2.2.1. Notation
We denote the measure of distance for a gene i, for i = 1,.,n having k features (or
dimensions), for k = 1,.., s as a ik . Each gene 32-time point expression pattern is
transformed into a 24-dimensional vector, for which each vector element indicates
the change in normalized expression level between time points for each gene, a ik .
Each gene is to be assigned to only one (hard clustering) of c possible clusters,
each with center z jk , for j = 1,.,c. The binary variables w ij indicates whether
gene i falls within cluster j ( w ij =1,ifyes; w ij = 0, if no). We then pre-cluster the
data to expedite the computational resources required to solve the hard clustering
problem by (i) identifying genes with similar experimental responses, and (ii) re-
moving outliers deemed not to be significant to the clustering process. To provide
just adequate discriminatory characteristics so that the genes can be pre-clustered
properly, we reduce the expression vectors into a set of representative variables
[+, o, -]. The (+) variable represents an increase in expression level compared to
the previous time point, the (-) variable represents a decrease in expression level
from the previous time point, and the (o) variable represents an expression level
that does not vary significantly (
10% of change across the time points). We
could have used other comparative metrics such as distance or correlation to pre-
cluster the genes, though at this first pass stage, using the representative variables
[+, o, -] lends more ease and produces pre-clusters of similar quality. Obviously
the pre-clustering process of choice can differ across datasets to be clustered, and
we choose the approach most expeditious to our data of interest.
±
16.2.2.2. Hard Clustering by Global Optimization
The global optimization approach seeks to minimize the Euclidean distances be-
tween the data points and the centers of their assigned clusters as:
Search WWH ::




Custom Search