Biology Reference
In-Depth Information
In previous years, many clinicians have been unable to provide a clear-
cut classification of cancerous patients, based upon the biopsy. However,
with the system proposed here, surveying the expression of thousands of
genes is made practical. This chapter outlines a very workable concept
which, with more development, will bring groundbreaking new potential
for accurate diagnosis. Its biggest advantage lies in the fact that the global
optimum is always found with little prior knowledge.
5.2.2. Datasets
Two popular microarray gene expression datasets, for colon cancer and
leukemia, were used in this study.
5.2.2.1. Colon data
The original gene expression data were downloaded from the Internet
(http://dir.niehs.nih.gov/microarray/datamining/public_html/colon.html.
The matrix I2000 contains the expression of the 2000 genes with highest
minimal intensity across the 62 tissues (Alon et al ., 1999). The genes are
placed in order of descending minimal intensity. Each entry in I2000 is a
gene intensity derived from the
20 feature pairs that correspond to the
gene on the chip. The data are otherwise unprocessed (for example, it has
not been normalized by the mean intensity of each experiment). The
“name” file contains the expressed sequence tag (EST) number and
description of each of the 2000 genes, in an order that corresponds to the
order in I2000. The identity of the 62 tissues is given in the file “tissues
data”. The numbers correspond to patients, a positive sign to a normal tis-
sue, and a negative sign to a tumor tissue. The data contain the expres-
sion levels of 2000 genes across the 62 samples, of which 40 are tumor
tissues and 22 are normal tissues. Other researchers indicated that there
were five tissue samples (Normal34, Normal36, Tumor30, Tumor33, and
Tumor36) identified as likely to have been contaminated (Li et al .,
2001a); to avoid having uncertainties, those five samples were removed
from the colon cancer dataset. Like the previous study (Li et al ., 2001b),
the remaining 57 samples were then divided into a training set (the first
40 samples) and a test set (17 samples). The numbers of tumor and normal
Search WWH ::




Custom Search