Information Technology Reference
In-Depth Information
arrays with the median GNUSE error scores [ 15 ] larger than a threshold value of 1
.
2
have been discarded, resulting in total 92 microarrays for analysis.
After applying the fRMA algorithm, we choose the “core” exons for which we
have the best evidence being a coding region, with unique hybridization, unique
localization on human chromosomes, and genes assigned, according to the NetAffx
probeset annotation v33.1. 4 This resulted in total 228476 features. Finally, the top
2000 exon features with largest standard deviation values have been chosen for
analysis.
To summarize, our data set consist of n
2000 features
corresponding to exons. Each patient has a tumor stage out of five stages (1, 2, 3, 4,
and 4s) assigned. We categorize the stages into low risk ( y i
=
92 patients with p
=
=−
1, stages 1, 2, and
4s) and high risk ( y i
1, stages 3 and 4), so to create a binary classification task.
The ratio of the two categories is about 50:50.
In this data set, the 2000 exon features are grouped into K
=+
845 genes.
Figure 14.2 shows the sizes of groups, that is, the number of exons (y-axis) in each
gene (a few gene names are on the x-axis). About 88% of genes consist of 1-4 exons,
whereas a gene C8 has the maximal size (30 exons).
=
14.3.2 Algorithms for Comparison
The following three algorithms are to be compared, with the loss function for logistic
regression, in order to identify features that are important for classifying high and
low risk categories.
Fig. 14.2 The number of exons in genes. As genes consist of exons, this shows the sizes of groups.
Most of the groups have 1-4 features
4 http://www.affymetrix.com/analysis/downloads/na33/
 
 
Search WWH ::




Custom Search