Beyond Classification Tasks - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

method may not offer information on the influence of many of the factors on

the behavior of the component. Ensemble tree-based algorithms are strong

methods which overcome this limitation.

The survival trees method is highly popular among the tree-based

methods. This method is useful for identifying factors that may influence a

failure event and the mileage or time to an event of interest. Survival trees

do not require any defined distribution assumptions and they tend to be

resistant to the influence of outliers. When a single tree framework is used,

the data are split by only a subset of factors and the rest are disregarded

due to the trees stopping conditions, e.g. minimum number of observations

in a terminal node. Therefore, a single tree-based method may not offer

information on the influence of many of the factors on the behavior of the

component. In order to overcome this limitation, a new ensemble approach

is proposed.

The Random Survival Forests (RSF) is a method for the analysis

of right-censored survival data. Both Random Forest (RF) and RSF are

very ecient algorithms for analyzing large multidimensional datasets.

However, due to their random nature they are not always intuitive and

comprehensible to the user. Different trees in the forest might yield

conflicting interpretations. In contrast to the RF, the C-Forest function

in R creates random forests from unbiased classification trees based on a

conditional inference framework.

Like in classification trees, also in survival trees the splitting criterion

that is very crucial to the success of the algorithm. Bou-Hamad et al . (2011)

provide a very detailed comparison of splitting criteria. Most of the existing

algorithms are using statistical tests for choosing the best split. One possible

approach is to use the logrank statistic to compare the two groups formed

by the children nodes. The chosen split is the one with the largest significant

test statistic value. The use of the logrank test leads to a split which

assures the best separation of the median survival times in the two children

nodes. Another option is to use the likelihood ratio statistic (LRS) under an

assumed model to measure the dissimilarity between the two children nodes.

Another option is to use the KolmogorovSmirnov statistic to compare the

survival curves of the two nodes. Some researchers suggest to select the split

based on residuals obtained from fitting a model. The degree of randomness

of the residuals is quantified and the split that appears the least random is

selected. The party package in R provides a set of tools for training survival

trees. Section 10.3 presents a walk-through-guide for building Regression

Trees in R.

Search WWH ::

Custom Search

Home