Neural computing in pharmaceutical products and process development - Computer-Aided Applications in Pharmaceutical Technology

Information Technology Reference

In-Depth Information

5.3.2 Theory

Decision trees are nonparametric statistical techniques used for

classifi cation problems (Breiman et al., 1984). They are generated by

recursively partitioning the data set, using a splitting attribute until all

the records in the partition belong to the same class (Chandra and

Varghesse, 2009). Recursive partitioning is a method for determination

of statistically meaningful rules that classify objects into similar categories

(Rusinko et al., 1999).

Being analogous to naturally occurring trees, decision trees consist of

branches and nodes. Nodes are either decision-making or terminating.

Terminating nodes, or tree leaves, represent specifi c classes of the data.

Decision nodes defi ne a test to be conducted on one of the data attributes,

in order to split the data into subsets. The tree starts from the root and

the branches lead to the subsequent nodes. The decision tree can be

interpreted as a rule induction technique, if different scenarios are

forecast from the tree structure. As a rule, the induction technique is

similar to previously described fuzzy logic.

Different algorithms are used for decision tree construction: ID3, C4.5,

C5.0, Classifi cation and Regression Trees (CART), Chi-squared

Automatic Interaction Detector (CHAID), etc. The most popular

algorithms, ID3, C4.5, and C5.0, were developed by Quinlan (1986,

1993). The ID3 algorithm was the fi rst and was based on the Concept

Learning System algorithm. The ID3 algorithm searches through the

attributes of the data set and extracts one attribute that best separates

given samples. Then it searches through the subsets created, to fi nd their

best attribute for further partitioning of the data set. There is no possibility

to go back once the subsets are created; therefore the same attribute

cannot be selected twice during tree construction. When an attribute is

found to perfectly classify samples, the ID3 algorithm stops. One of the

shortcomings of the ID3 algorithm is that it only deals with discrete

values of data samples. Therefore, discretization of continuous values is

necessary to perform successful classifi cation and different discretization

techniques are available.

The C4.5 algorithm is an improved version of the ID3 algorithm in

terms of possibility to also analyze continuous attributes. The algorithm

itself splits continuous samples into 2 sets, smaller and greater than a

threshold value (Quinlan, 1996). Also, it allows handling of data with

missing attribute values and prunes trees after creation. In order to limit

excessive partitioning of the data, tree depth (i.e. branching) can be

controlled. This is done by so-called tree pruning, which is the process of

Search WWH ::

Custom Search

Home