Biomedical Engineering Reference
In-Depth Information
methods have been used in the literature to provide some degree of validation
for the selected genes. In one method, which could be called validation by clas-
sification , the selected genes are validated if they perform adequately in the pre-
diction of the type of tissue (case or control) for samples excluded from the
training set but whose class is known. In a second approach, which could be
called validation by statistical significance , genes are chosen if they behave suf-
ficiently different from what would be expected if there were no class distinction
between the case and control samples (the null hypothesis). We will discuss
other possibilities for validation later on. Before describing the recent literature
on gene selection, however, a few nomenclature conventions are necessary.
2.1. Nomenclature
Throughout this section it will be assumed that we are dealing with an assay
in which M samples were hybridized to their respective M arrays, each contain-
ing N gene probes. Of the M samples, M 1 cases are of class 1 ( C 1 ) and M 2 cases
are of class 2 ( C 2 ), where by class 1 and 2 we mean cancer and control, or cancer
of type 1 and cancer of type 2, etc. The values of the expression measured for
the i th gene in the k th sample of class c will be denoted by X ik ( c ) . In many algo-
rithms data are preprocessed by different normalizations and transformations. In
these cases we shall still denote by X ik ( c ) the resulting gene expression values
after the preprocessing steps. (For a review of normalization considerations see
(17).) The sample mean and standard deviation of gene i in class c will be re-
spectively denoted N i ( c ) and T i ( c ) .
2.2. Selecting Genes One at a Time: Univariate Methods
2.2.1. t-Score-Based Statistics
One of the most common univariate analyses uses the t -statistic (or t -score),
which for gene i can be written as:
NN
i
(1)
i
(2 )
t
=
.
[1]
i
T
2
/
M
+
T
2
/
M
i
(1)
1
i
( 2 )
2
This statistic measures the difference between the sample means in cases
and controls in units of the standard deviation of this difference. If the two sam-
ples are normally distributed, or if M 1 and M 2 are large, the theoretical distribu-
tion of the t -score is known. In the former case t i would be distri-
buted according to the t -distribution, and in the latter case the distribution of t i
Search WWH ::




Custom Search