Biomedical Engineering Reference
In-Depth Information
the FDR at 10% would allow for 10% of declared differentially expressed genes
to be falsely identified.
11.6.5 Microarray Analysis
One of the major areas of focus among the bioinformatics community is the anal-
ysis of microarray (chip) data, where thousands of data points are generated in a
single experiment either for gene mRNA (oligonucleotide expression array as an
example) or DNA deletion studies (SNP chips as an example) [114]. Although a
thorough discussion of this topic is beyond the scope of this chapter, it is impor-
tant to raise some important issues. In general, there are two types of analytic
problems: (1) class comparison and (2) class prediction.
Class comparison involves comparing the high dimensional gene expression
profiles across groups or conditions (for example, high versus low grade tumors,
responsive versus resistant diseases, normal versus tumor cells, stroma versus ep-
ithelium cells). One approach is to compare genes one by one in order to identify
overexpressed mRNAs. Here, the problems of multiple comparisons that were dis-
cussed above for qRT-PCR are magnified since thousands of genes are compared
across groups rather than only dozens of genes. Therefore, it becomes especially
essential to control the probability of a false positive result using multiple compar-
isons procedures such as controlling the false discovery rate (FDR). One approach
that is commonly used is to do all testing at a 0.001 significance level (i.e., a gene
is differentially expressed when the p-value
0.001 using a t-test). A gene list is
then constructed based on the statistically significant differences between genes,
and a false discovery rate can be computed. An overall test of whether the pat-
terns in gene expression are different between groups can be constructed using a
permutation test. Specifically, we can scramble the class labels (group identifiers)
and redo the analysis many times (say, 5,000). The p-value for a test of whether
the gene expression profiles are different across groups can be computed as the
proportion of times the number of significant genes is above the number of ''signif-
icant'' genes in the scrambled datasets. Visually, these gene expression patterns can
be compared across groups by multidimensional scaling (MDS). MDS compresses
differences between sample expression profiles into three eigenvectors for plotting
in three-dimensional space.
Class prediction involves developing a predictor of disease outcome from
high-dimensional microarray data. There are many methods for developing a
class predictor including discriminate analysis, logistic regression, neural network
methodology, and classification trees (see [114] for a comparison of approaches).
It is particularly important to emphasize that any predictive model needs to be
validated on a completely independent dataset. Validating a predictive model on
the same dataset for which the model was developed can result in over-fitting and
an overoptimistic assessment of the quality of the predictive model. One approach
that is commonly used is to split the dataset into a training set in which the predic-
tive model is developed and a test-set in which the predictive model is validated.
This can be done by splitting the data in two and fitting the predictive models
in the first half of the data and evaluating the accuracy of the predictions in the
second half.
<
 
Search WWH ::




Custom Search