Biomedical Engineering Reference
In-Depth Information
controls is conspicuous. However, the fact that these markers clearly separate
CLL patients from normal subjects in the Mayo data is expected given that these
genes were specifically selected to separate cases from controls in this data set .
A much more stringent verification of whether these markers express differen-
tially in CLL versus control would be to establish an equivalent differential ex-
pression in a completely independent data set. The study of Klein et al. (52)
(performed in R. Dalla-Favera's laboratory at Columbia University) provides
such an independent data set, the Columbia University (CU) data. The colored
matrix to the right of the yellow line in Figure 3 clearly shows that the 81 probes
found to differentially express in the analyses of the Mayo data have the same
qualitative behavior in the CU data. The fact that the majority of the genes that
underexpress in CLL versus control (as well as the genes that overexpress in
CLL versus control) in the Mayo data also do so in the CU data is a definitive
indication of the informative nature of these genes in the context of CLL. This is
not only a biological validation that the genes that arose from the gene selection
algorithms are truly differentially expressed in CLL with respect to control, but
also serves to validate the reproducibility of the DNA array technology. As
stressed in the previous section, this kind of validation, which we called valida-
tion by consistency, is likely to become much more widespread as more gene
expression data produced in different laboratories pertaining to the same
case/control studies become available.
The previous discussion naturally leads to the question of whether the gene
expression values of these 81 probes can be used to create a diagnostic method
to determine whether or not a subject is affected by CLL. The idea in this case is
to create a decision rule based on the gene expression values of these 81 probes.
The flourishing field of machine learning (54,55) provides a number of tech-
niques to determine decision rules. Many of these learning techniques have been
applied to gene expression research. Among them we can mention nearest-
neighbor classifiers (e.g. (14)), neural networks (e.g. (56)), and support vector
machines (e.g. (57)). The latter has proved to be a very powerful method for
separating two classes. In addition, it has an intuitive geometrical interpretation.
We shall use a support vector machine classifier to show that the disease
state of the subjects in the Columbia data can be perfectly predicted using the
Mayo data. Let us briefly explain how support vector machines operate. Figure
4a exemplifies a two-dimensional space, and in it we have two classes of points:
the cases (solid circles) and controls (open circles). In its simplest conception, a
support vector machine will attempt to compute a line that perfectly separates
the cases from the controls, and whose distance to the closest point (or points) in
each class is maximal (the optimal hyperplane). If this hyperplane exists the
problem is said to be linearly separable, but this need not be the case in general.
Some additional constraints are necessary if the problem is not linearly separa-
ble (58), but we shall not discuss them in this chapter. Once the optimal hyper-
plane has been found in the training set (in our case, the Mayo data), then we
Search WWH ::




Custom Search