Applications of the SDL Global Optimization Method in DNA Microarray Data Analysis - DNA Microarray Technology and Data Analysis in Cancer Research

Biology Reference

In-Depth Information

In previous years, many clinicians have been unable to provide a clear-

cut classification of cancerous patients, based upon the biopsy. However,

with the system proposed here, surveying the expression of thousands of

genes is made practical. This chapter outlines a very workable concept

which, with more development, will bring groundbreaking new potential

for accurate diagnosis. Its biggest advantage lies in the fact that the global

optimum is always found with little prior knowledge.

5.2.2. Datasets

Two popular microarray gene expression datasets, for colon cancer and

leukemia, were used in this study.

5.2.2.1. Colon data

The original gene expression data were downloaded from the Internet

(http://dir.niehs.nih.gov/microarray/datamining/public_html/colon.html.

The matrix I2000 contains the expression of the 2000 genes with highest

minimal intensity across the 62 tissues (Alon et al ., 1999). The genes are

placed in order of descending minimal intensity. Each entry in I2000 is a

gene intensity derived from the

20 feature pairs that correspond to the

gene on the chip. The data are otherwise unprocessed (for example, it has

not been normalized by the mean intensity of each experiment). The

“name” file contains the expressed sequence tag (EST) number and

description of each of the 2000 genes, in an order that corresponds to the

order in I2000. The identity of the 62 tissues is given in the file “tissues

data”. The numbers correspond to patients, a positive sign to a normal tis-

sue, and a negative sign to a tumor tissue. The data contain the expres-

sion levels of 2000 genes across the 62 samples, of which 40 are tumor

tissues and 22 are normal tissues. Other researchers indicated that there

were five tissue samples (Normal34, Normal36, Tumor30, Tumor33, and

Tumor36) identified as likely to have been contaminated (Li et al .,

2001a); to avoid having uncertainties, those five samples were removed

from the colon cancer dataset. Like the previous study (Li et al ., 2001b),

the remaining 57 samples were then divided into a training set (the first

40 samples) and a test set (17 samples). The numbers of tumor and normal

∼

Search WWH ::

Custom Search

Home