Decision Tree Applications for Data Modelling (Artificial Intelligence)

INTRODUCTION

Many organisations, nowadays, have developed their own databases, in which a large amount of valuable information, e.g., customers’ personal profiles, is stored. Such information plays an important role in organisations’ development processes as it can help them gain a better understanding of customers’ needs. To effectively extract such information and identify hidden relationships, there is a need to employ intelligent techniques, for example, data mining.

Data mining is a process of knowledge discovery (Roiger & Geatz, 2003). There are a wide range of data mining techniques, one of which is decision trees. Decision trees, which can be used for the purposes of classifications and predictions, are a tool to support decision making (Lee et al., 2007). As a decision tree can accurately classify data and make effective predictions, it has already been employed for data analyses in many application domains. In this paper, we attempt to provide an overview of the applications that decision trees can support. In particular, we focus on business management, engineering, and health-care management.

The structure of the paper is as follows. Firstly, Section 2 provides the theoretical background of decision trees. Section 3 then moves to discuss the applications that decision trees can support, with an emphasis on business management, engineering, and health-care management. For each application, how decision trees can help identify hidden relationships is described. Subsequently, Section 4 provides a critical discussion of limitations and identifies potential directions for future research. Finally, Section 5 presents the conclusions of the paper.


BACKGROUND

Decision trees are one ofthe most widely used classification and prediction tools. This is probably because the knowledge discovered by a decision tree is illustrated in a hierarchical structure, with which the discovered knowledge can easily be understood by individuals even though they are not experts in data mining (Chang et al., 2007). A decision tree model can be created in several ways using existing decision tree algorithms. In order to effectively adopt such algorithms, there is a need to have a solid understanding of the processes of creating a decision tree model and to identify suitability of the decision tree algorithms used. These issues are described in subsections below.

Processes of Model Development

A common way to create a decision tree model is to employ a top-down, recursive, and divide-and-conquer approach (Greene & Smith, 1993). Such a modelling approach enables the most significant attribute to be located at the top level as a root node and the least significant attributes to be located at the bottom level as leave nodes (Chien et al., 2007). Each path between the root node and the leave node can be interpreted as an ‘if-then’ rule, which can be used for making predications (Chien et al., 2007; Kumar & Ravi, 2007).

To create a decision tree model on the basis of the above-mentioned approach, the modelling processes can be divided into three stages, which are: (1) tree growing, (2) tree pruning, and (3) tree selection.

Tree Growing

The initial stage of creating a decision tree model is tree growing, which includes two steps: tree merging and tree splitting. At the beginning, the non-significant predictor categorises and the significant categories within a dataset are grouped together (tree merging). As the tree grows, impurities within the model will increase. Since the existence of impurities may result in reducing the accuracy of the model, there is a need to purify the tree. One possible way to do it is to remove the impurities into different leaves and ramifications (tree splitting) (Chang, 2007).

Tree Pruning

Tree pruning, which is the key elements of the second stage, is to remove irrelevant splitting nodes (Kirkos et al., 2007). The removal of irrelevant nodes can help reduce the chance of creating an over-fitting tree. Such a procedure is particularly useful because an over-fitting tree model may result in misclassifying data in real world applications (Breiman et al., 1984).

Tree Selection

The final stage of developing a decision tree model is tree selection. At this stage, the created decision tree model will be evaluated by either using cross-validation or a testing dataset (Breiman et al., 1984). This stage is essential as it can reduce the chances of misclassify-ing data in real world applications, and consequently, minimise the cost of developing further applications.

Suitability of Decision Tree Algorithms

A review of existing literature shows that the most widely used decision tree algorithms include the Iterative Dichotomiser 3 (ID3) algorithm, the C4.5 algorithm, the Chi-squared Automatic Interactive Detector (CHAID) algorithm, and the Classification and Regression Tree (CART) algorithm. Amongst these algorithms, there are some differences, one of which is the capability of modelling different types of data. As a dataset may be constructed by different types of data, e.g., categorical data, numerical data, or the combination of both, there is a need to use a suitable decision tree algorithm which can support the particular type of data used in the dataset. All of the above-mentioned algorithms can support the modelling of categorical data whilst only the C4.5 algorithm and the CART algorithm can be used for the modelling of numerical data (see Table 1). This difference can also be used as a guideline for the selection of a suitable decision tree algorithm. The other difference amongst these algorithms is the process of model development, especially at the stages of tree growing and tree pruning. In terms of the former, the ID3 and C4.5 algorithms split a tree model into as many ramifications as necessary whereas the CART algorithm can only support binary splits. Regarding the latter, the pruning mechanisms located within the C4.5 and CART algorithms support the removal of insignificant nodes and ramifications but the CHAID algorithm hinders the tree growing process before the training data is being overused (see Table 1).

DECISION TREE APPLICATIONS

Business Management

In the past decades, many organizations had created their own databases to enhance their customer services. Decision trees are a possible way to extract useful information from databases and they have already been employed in many applications in the domain of business and management. In particular, decision tree modelling is widely used in customer relationship management and fraud detection, which are presented in subsections below.

Customer Relationship Management

A frequently used approach to manage customers’ relationships is to investigate how individuals access online services. Such an investigation is mainly performed by collecting and analyzing individuals’ usage data and then providing recommendations based on the extracted information. Lee et al. (2007) apply decision trees to investigate the relationships between the customers’ needs and preferences and the success of online shopping. In their study, the frequency of using online shopping is used as a label to classify users into two categories: (a) users who rarely used online shopping and (b) users who frequently used online shopping. In terms of the former, the model suggests that the time customers need to spend in a transaction and how urgent customers need to purchase a product are the most important factors which need to be considered. With respect to the latter, the created model indicates that price and the degree of human resources involved (e.g. the requirements of contacts with the employees of the company in having services) are the most important factors. The created decision trees also suggest that the success of an online shopping highly depends on the frequency of customers’ purchases and the price of the products. Findings discovered by decision trees are useful for understanding their customers’ needs and preferences.

Fraudulent Statement Detection

Another widely used business application is the detection of Fraudulent Financial Statements (FFS). Such an application is particularly important because the existence of FFS may result in reducing the government’s tax income (Spathis et al., 2003). A traditional way to identify FFS is to employ statistical methods. However, it is difficult to discover all hidden information due to the necessity of making a huge number of assumptions and predefining the relationships among the large number of variables in a financial statement.

Previous research has proved that creating a decision tree is a possible way to address this issue as it can consider all variables during the model development process. Kirkos et al. (2007) have created a decision tree model to identify and detect FFS. In their study, 76 Greek manufacturing firms have been selected and their published financial statements, including balance sheets and income statements, have been collected for modelling purposes. The created tree model shows that all non-fraud cases and 92% of the fraud cases have been correctly classified. Such a finding indicates that decision trees can make a significant contribution for the detection of FFS due to a highly accurate rate.

Engineering

The other important application domain that decision trees can support is engineering. In particular, decision trees are widely used in energy consumption and fault diagnosis, which are described in subsections below.

Energy Consumption

Energy consumption concerns how much electricity has been used by individuals. The investigation of energy consumption becomes an important issue as it helps utility companies identify the amount of energy needed. Although many existing methods can be used for the investigation of energy consumption, decision trees appear to be preferred. This is due to the fact that a hierarchical structure provided by decision trees is useful to present the deep level of information and insight. For instance, Tso and Yau (2007) create a decision tree model to identify the relationships between a household and its electricity consumptions in Hong Kong. Findings from their tree model illustrate that the number of household members are the most determinant factor of energy consumption in summer, and both the number of air-conditioner and the size of a flat are the second most important factors. In addition to such findings, their tree model identifies that a household with four or more members with a flat size larger than 817ft2 is the highest electricity consumption group. On the other hand, households which have less than four family members and without air-conditioners are the smallest electricity consumption group. Such findings from decision trees not only provide a deeper insight of the electricity consumptions within an area but also give guidelines to electricity companies about the right time they need to generate more electricity.

Table 1. Characteristics of different decision tree algorithms

Decision tree algorithms Data types Numerical data splitting method Possible tool
CHAID (Kass, 1980) Categorical N/A SPSS Answer Tree (SPSS Inc, 2007)
ID3 (Quinlan, 1986) Categorical No restrictions WEKA (Ian and Eibe, 2005)
C4.5 (Quinlan, 1993) Categorical, numerical No restrictions WEKA (Ian and Eibe, 2005)
CART (Breiman et al., 1984) Categorical, numerical Binary splits CART 5.0 (Salford Systems, 2004)

Fault Diagnosis

Another widely used application in the engineering domain is the detection of faults, especially in the identification of a faulty bearing in rotary machineries. This is probably because a bearing is one of the most important components that directly influences the operation of a rotary machine. To detect the existence of a faulty bearing, engineers tend to measure the vibration and acoustic emission (AE) signals emanated from the rotary machine. However, the measurement involves a number of variables, some of which may be less relevant to the investigation. Decision trees are a possible tool to remove such irrelevant variables as they can be used for the purposes of feature selection. Sugumaran and Ramachandran (2007) create a decision tree model to identify the features that may significantly affect the investigation of a faulty bearing. Through feature selection, three attributes were chosen to discriminate the faulty conditions of a bearing, i.e., the minimum value of the vibration signal, the standard deviation of the vibration signal, and kurtosis. The chosen attributes, subsequently, were used for creating another decision tree model. Evaluations from this model show that more than 95% of the testing dataset has been correctly classified. Such a highly accurate rate suggests that the removal of insignificant attributes within a dataset is another contribution of decision trees.

Healthcare Management

As decision tree modelling can be used for making predictions, there are an increasing number of studies that investigate to use decision trees in health-care management. For instance, Chang (2007) has developed a decision tree model on the basis of 516 pieces of data to explore the hidden knowledge located within the medical history of developmentally-delayed children. The created model identifies that the majority of illnesses will result in delays in cognitive development, language development, and motor development, of which accuracies are 77.3%, 97.8%, and 88.6% respectively. Such findings can result in assisting healthcare professional to have an early intervention on developmentally-delayed children so as to help them catch up their normal peers in their development and growth. Another example of health-care management can be found in Delen et al. (2005). In their study, a decision tree is created to predict the survivability of breast cancer patients. The classification accuracy is 93.6% in their decision tree. This classification rate indicates that the created tree is highly accurate for predicting the survivability of breast cancer patients. These studies suggest that decision tree is a useful tool to discover and explore hidden information in health-care management.

FUTURE TRENDS

The applications domains mentioned above demonstrate that decision tree is a very useful tool for data analyses. However, there are still many limitations which we need to be aware of and addressed in future works.

Reliability of Findings

Although decision tree is a powerful tool for data analyses, it seems that some data are misclassified in the decision tree models. A possible way to address this issue is to exploit the extracted knowledge by human-computer collaboration. In other words, experts from different domains use their domain knowledge to filter findings from the created model. By doing so, the irrelevant findings can manually be removed. However, the drawback of employing such a method is the necessity of large investment as it involves the cost and time of experts from different domains.

Suitability of Algorithms

As described in Section 2.2, the development of a decision tree model involves the selection of an appropriate decision tree algorithm. In addition to taking into account the type of data being modelled, there is a need to consider the effectiveness of the algorithms. Another possible direction for future research is to compare the effectiveness of various algorithms and identify the strengths and weaknesses of each algorithm for different types of applications. In addition, it would be interesting for future research to conduct comparisons between decision tree algorithms and other types of classification algorithms. By doing so, guidelines for the selection of suitable decision tree algorithms for different types of applications can be generated.

CONCLUSION

The main objective of this paper is to help readers get an overall picture of decision trees by introducing its applications in different domains. To achieve this objective, this paper has provided an overview of the applications of decision tree modelling in business management, engineering, and health-care management domains. In each application domain, the benefits of creating a decision tree model for the purposes of analyzing data and making predictions have been identified. Such benefits include: (1) the capability to accurately discover hidden relationships between variables, (2) the presentation of knowledge in a deep level of understanding and insight on the basis of its hierarchical structure, and (3) the capability of removing insignificant attributes within a dataset.

Three application domains have been studied in this paper, but it ought to be noted that decision trees can also be applied in other application domains, e.g. bioinformatics and psychology. These application domains should also be examined. The findings of such studies can subsequently be integrated into those of this study so that a complete framework for implementing decision trees models can be created. Such a framework would be useful to enhance the depth and breadth of the knowledge of decision tree models.

KEY TERMS

Attributes: Pre-defined variables in a dataset.

Classification: An allocation of items or objects to classes or categories according to their features.

Customer Relationship Management: A dynamic process to manage the relationships between a company and her customers, including collecting, storing and analysing customers’ information.

Data Mining: Also known as knowledge discovery in database (KDD), which is a process of knowledge discovery by analysing data and extracting information from a dataset using machine learning techniques.

Decision Tree: A predictive model which can be visualized in a hierarchical structure using leaves and ramifications.

Decision Tree Modelling: The process of creating a decision tree model.

Fault Diagnosis: An action of identifying a malfunctioning system based on observing its behaviour.

Fraud Detection Management: The detection of frauds, especially in those existing in financial statements or business transactions so as to reduce the risk of loss.

Healthcare Management: The act of preventing, treating and managing illness, including the preservation of mental and physical problems through the services provided by health professionals.

Prediction: A statement or a claim that a particular event will happen in the future.

Next post:

Previous post: