Knowledge Extraction from Microarray Datasets Using Combined Multiple Models to Predict Leukemia Types - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

follows the comprehensibility of a single decision tree. Although, there are

many research that have demonstrated a higher level of accuracy in classifying

cancer cells, for example [2, 3], the comprehensibility issue of decision trees

to gain best accuracy in the domain of microarray data analysis has been

ignored [4-7].

In this study, we attempt to combine the high accuracy of ensembles and

the interpretability of the single tree in order to derive exact rules that de-

scribe differences between significantly expressed genes that are responsible

for leukemia. To achieve this, Combined Multiple Models (CMM) method

has been applied, which was proposed originally by Domingos in [8]. In our

study the method is adapted for multidimensional and real valued microarray

datasets to eliminate the colinearity and multivariate problems. All datasets

from our experiment are publicly available from the Kent Ridge Repository

described in [20]. These microarray samples are the examples of human tis-

sue extracts that are related to a specific disease and have been used for

comprehensible interpretation in this study. The following sections explore

the datasets, methods of CMM adaptation and testing. It also presents the

results that are obtained by applying the adapted method on four publicly

available databases. Finally the chapter presents a validation study by pro-

viding an interpretation of the results in the context of rule sets and then by

comparing the proposed adaptations with the combined and simple decision

trees for leukemia grouping.

2 Combined Multiple Models for Gene Expression

Analysis

Data mining is the process of autonomously extracting useful information or

knowledge from large datasets. Many different models can be used in data min-

ing process. However, it is required for many applications not only to involve

model that produce accurate predictions, but also to incorporate comprehen-

sible model. In many applications it is not enough to have accurate model,

but we also want comprehensible model that can be easily interpreted to the

people not familiar with data mining. For example, Tibshirani and Knight [9]

proposed a method called Bumping that tries to use bagging and produce a

single classifier that best describes the decisions of bagged ensemble. It builds

models from bootstrapped samples and keeps the one with the lowest error

rate on the original data. Typically this is enough to get good results also on

test set. We should also mention papers that suggest different techniques of

extracting decision trees from neural networks or ensembles of neural networks

that can all be seen as a “black-box” method [10-12].

Search WWH ::

Custom Search

Home