Neural computing in pharmaceutical products and process development - Computer-Aided Applications in Pharmaceutical Technology

Information Technology Reference

In-Depth Information

The process of new attribute selection and partitioning is continued for

each node, until all attributes are used or when samples associated with

a specifi c node all have the same value for the target attribute (i.e. their

entropy is 0). ID3 and C4.5 algorithms (and some of their improved

versions) use information gain as a splitting criterion. Gain ratio is a

modifi cation of the information gain that takes into account the intrinsic

information of a split, that is, it takes the number and size of branches

into account when choosing an attribute (Abraham et al., 2009).

The Gini index (Gini coeffi cient, Gini impurity) identifi es a particular

attribute and the cut point for that attribute that minimizes the variance

or diversity in each of the two subsets that emerge from the split (Breiman

et al., 1984):

[5.20]

The Gini index is always between 0 and 1, regardless of the number of

classes, c , of an attribute in the data set. When all samples are in the same

class, the Gini index has a value of 0, whereas it reaches its maximum

value of 1 when the probability that a sample belongs to one of the classes

is the same for each class. The CART algorithm uses the Gini index as the

splitting criterion.

In the case of simultaneous multiple-class classifi cation, random forests

can be developed where the forest consists of trees and each tree is created

from a bootstrapped sample of the data set. GAs have been used to avoid

local optimal decisions and search the decision tree space with little a

priori bias (Papagelis and Kalles, 2001).

The building of a decision tree is a supervised process, which means that

all data attributes and classes are defi ned before tree construction begins.

Also, each sample of the data set must belong to a certain class and there

should be more samples than classes (in order to avoid classes containing

only one sample). It is recommended to have a data set of many samples

for tree construction. Parameters that the user needs to input, in order to

build a decision tree, depend on the software used. Most of the programs

allow automatic selection of predefi ned parameters, but users are

sometimes encouraged to examine usage of different algorithms, splitting

criterion, pruning parameters, minimum samples per class, thresholds, etc.

Also, once the decision tree is built, it is important to evaluate its

predictability. This is usually done by the cross-validation approach.

Accuracy is the overall classifi cation criteria of a prediction model; it

corresponds to the ratio of correctly classifi ed compounds to the total

compounds (Choi et al., 2009). Since most decision trees are used for

Search WWH ::

Custom Search

Home