Biomedical Engineering Reference
In-Depth Information
Affymetrix GeneChip ® HUGeneFL array, which predates the Hu95Av2 Affy-
metrix GeneChip ® array used for the collection of the CU data. The last one is
not just an anecdotic comment. The fact that different GeneChip ® arrays were
used in the comparison (different in gene probes, location of probes in the chip,
number of probes, etc.) and yet a highly similar behavior was obtained in the CU
data using the genes selected from the WI data is indicative that the composite
of genes selected using different methods is biologically relevant in that it con-
tains reproducible information about the DLBCL and FL phenotypes. The se-
lected genes are not a statistical artifact .
The qualitative validation of the genes selected in the WI data and tested in
the CU data revealed by Figure 2 needs to be complemented by a quantitative
assessment of the likelihood that this similarity occurs by chance. We can esti-
mate the statistical significance of the comparison by counting the number of
consistent genes across the two data sets. We say that a gene is consistent if the
sign of the difference of the average expression in DLBCL and FL is the same in
both data sets. A p -value for the number of consistent genes can be estimated as
the probability that the same or a larger number of genes found to be consistent
in the two data sets will be found consistent if the genes are chosen at random.
This p -value can be calculated from a binomial distribution with a probability
parameter equal to 0.5 (i.e., the null hypothesis assumes that in the CU data set
each gene had equal probability of overexpressing or underexpressing in
DLBCL vs. FL). The p -value of the consistency for these selected features
across these two data sets can be computed independently for the genes upregu-
lated in DLBCL vs. FL, those downregulated, and the full set of genes. The re-
sults are shown in Table 1, where we can see that the resulting p -values are
extremely small. Notice that this validation is neither a validation by statistical
significance within the same data set nor a validation by classification. It is
something in between those two validation schemes, in which a test data set is
assessed for consistency with a training data set. From Table 1 we conclude that
the gene expression profile identified by the three methods is highly reproduci-
ble in an independent data set. Indeed, out of the 210 selected genes, 184 (88%)
showed a consistent behavior in the CU data set. This high percentage of coinci-
dence is extremely unlikely to be found purely by chance, and confirms the in-
formative nature of the genes selected by our gene selection methods.
4.
GENE EXPRESSION ARRAYS CAN BE USED FOR
DIAGNOSTICS : A CASE STUDY
The practice of both combining gene selection methods on the one hand,
and of validating the selected genes across laboratories on the other, will likely
be used frequently in the future. This is bound to be the case because of the
existence of more available data sets on the same types of tissues in the public
Search WWH ::




Custom Search