Databases Reference
In-Depth Information
follows the comprehensibility of a single decision tree. Although, there are
many research that have demonstrated a higher level of accuracy in classifying
cancer cells, for example [2, 3], the comprehensibility issue of decision trees
to gain best accuracy in the domain of microarray data analysis has been
ignored [4-7].
In this study, we attempt to combine the high accuracy of ensembles and
the interpretability of the single tree in order to derive exact rules that de-
scribe differences between significantly expressed genes that are responsible
for leukemia. To achieve this, Combined Multiple Models (CMM) method
has been applied, which was proposed originally by Domingos in [8]. In our
study the method is adapted for multidimensional and real valued microarray
datasets to eliminate the colinearity and multivariate problems. All datasets
from our experiment are publicly available from the Kent Ridge Repository
described in [20]. These microarray samples are the examples of human tis-
sue extracts that are related to a specific disease and have been used for
comprehensible interpretation in this study. The following sections explore
the datasets, methods of CMM adaptation and testing. It also presents the
results that are obtained by applying the adapted method on four publicly
available databases. Finally the chapter presents a validation study by pro-
viding an interpretation of the results in the context of rule sets and then by
comparing the proposed adaptations with the combined and simple decision
trees for leukemia grouping.
2 Combined Multiple Models for Gene Expression
Analysis
Data mining is the process of autonomously extracting useful information or
knowledge from large datasets. Many different models can be used in data min-
ing process. However, it is required for many applications not only to involve
model that produce accurate predictions, but also to incorporate comprehen-
sible model. In many applications it is not enough to have accurate model,
but we also want comprehensible model that can be easily interpreted to the
people not familiar with data mining. For example, Tibshirani and Knight [9]
proposed a method called Bumping that tries to use bagging and produce a
single classifier that best describes the decisions of bagged ensemble. It builds
models from bootstrapped samples and keeps the one with the lowest error
rate on the original data. Typically this is enough to get good results also on
test set. We should also mention papers that suggest different techniques of
extracting decision trees from neural networks or ensembles of neural networks
that can all be seen as a “black-box” method [10-12].
Search WWH ::




Custom Search