Realizing Knowledge Assets in the Medical Sciences with Data Mining: An Overview

abstract

This topic provides insight into various areas within the medical field that strive to take advantage of different data mining techniques in order to realize the full potential of their knowledge assets. Specifically, this is done by discussing many of the limitations associated with conventional methods of diagnosis and showing how data mining can be used to improve these methods. Comparative analyses of different techniques associated with various areas within the medical field are outlined in order to identify the right technique for particular medical specialties. Furthermore, suggestions are provided to appropriately utilize the various data mining techniques thereby leading to effective and efficient knowledge management and knowledge utilization. In this topic we highlight the potential of data mining in improving the exploratory as well as the predictive capabilities of conventional diagnostic methods in medical science.

introduction

Knowledge management is an emerging business approach aimed at solving current business challenges to increase efficiency and effectiveness of core business processes while simultaneously fostering continuous creativity and innovation. Specifically, knowledge management through the use of various tools, processes and techniques combines germane organizational data, information and knowledge to create business value and enable an organization to capitalize on its intangible (e.g., knowledge) and intellectual assets so that it can effectively achieve its primary business goals as well as maximize its core business competencies (Swan et al., 1999; Davenport & Prusak, 1998). The need for knowledge management is based on a paradigm shift in the business environment where knowledge is central to organizational performance (Drucker, 1993).

Knowledge management offers organizations many tools, techniques and strategies to apply totheir existing business processes. In essence then, knowledge management not only involves the production of information but also the capture of data at the source, the transmission and analysis of this data as well as the communication of information based on or derived from the data to those who can act on it (Swan et al., 1999). Fundamental to knowledge management is effectively integrating people, processes and technologies.

A pivotal technique in knowledge management is data mining which is used to discover new knowledge from existing data and information and thus grow the extant knowledge asset of the organization. This is particularly relevant to health care because not only is health care a knowledge-based industry, but it is also currently experiencing exponential growth in the collection of data and information primarily due to new legislative initiatives such as Managed Care and HIPAA (Health Information Portability and Accountability Act) in the US. This then makes it imperative for medical science to incorporate the benefits of this technique. We address this imperative by first discussing basic concepts of data mining and how they relate to the medical sciences. Next we elaborate upon key data mining techniques as well as their advantages and disadvantages and how they contribute to the building of important knowledge assets within health care.

background to data mining

In the literature, data mining is generally described at two levels: a broad perspective and a narrow perspective. While the broader perspective equates data mining to the process of Knowledge Discovery in Databases (KDD), the narrow perspective sees data mining as a step within this KDD process. In either case data mining can be defined as, “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data ” (Frawley et al.,1992). Data mining uses machine learning, as well as statistical and visualization techniques to discover and present knowledge in a form that is easily comprehensible to humans. Data mining involves sifting through huge amounts of data and extracting the relevant pieces of data for the particular analysis of a problem. More than just conventional data analysis (such as basic statistical methods), the technique makes heavy use of artificial intelligence. Often the emphasis is not as much on the extracting of data but more on the generating of a hypothesis, as in the case of exploratory data mining. Data mining also uses sophisticated statistical analysis and modeling techniques, which allow users to find useful information such as trends and patterns hidden in their business data. Data mining is one of the latest technologies to assist users deal with the abundance of data that they have collected over time. For example, this technique will help optimize business decisions, increase the value of each customer, enhance communication, and improve customer satisfaction. The retail industry has been using data mining technology to understand customer buying patterns, product warranty management, detection of fraud, and identification of good credit risks. Data Mining has become more popular over time due to the following reasons:

1. The main reason for the popularity of various data mining techniques is due to the large amount of data already collected and newly appearing data that requires processing beyond traditional approaches. The amount of data collected by various businesses, scientific, medical and governmental organizations around the world is enormous. It is impossible for human analysts to cope with the ever-growing and overwhelming amounts of data.

2. When a person analyses the data, he/she is liable to make errors due to the inadequacy of the human brain (i.e., the bounded rationality problem) to solve complex multifactor dependencies in the data and sometimes a lack of objectiveness in such an analysis. A human always tries to derive results based upon previous experiments and experiences gained from investigating other systems, unlike data mining which simply reflects what the data is conveying without preconceived hypotheses.

3. One more advantage of data mining is that, particularly in the case of large amounts of data, this process involves a much lower cost than hiring a team of experts. Although this technique does not discard the human involvement, it significantly simplifies the j ob and allows an analyst who is not proficient in statistics or programming to manage the process of extracting knowledge from data (Mega Computer Intelligence).

Data Mining in Medical sciences

The medical sciences offer a unique opportunity to apply the many techniques of data mining. This is because health care generates mountains of administrative data about patients, hospitals, utilization, claims, etc. In addition, clinical trials, electronic patient records and computer supported disease management increasingly produce large amounts of clinical data. This data, both the administrative and clinical, is a strategic resource for health care institutions since it represents a raw form of their knowledge

Data mining discovers the patterns and correlations hidden within this raw knowledge, i.e., the data repository. Furthermore, it enables health care professionals to use these patterns to aid in decision making and the establishment of revised and improved treatment protocols, and thereby enhance organizational performance.

Previous studies (Maria-Luiza et al., 2001) in various areas of the medical sciences have revealed that conventional methods of detecting symptoms or other health-related problems have been very costly and error prone. Due to the complexities and inconsistencies in these detection methods, the diagnoses which are based on the information gained from these methods can lead to outcomes that are sometimes dangerous and even could lead to a person’s death. For example, during the prognosis of breast cancer, the main detection method available is mammography. Due to the high volume and variation in the stage of potential malignancy of tumors from mammograms that need to be read by physicians, the accuracy rate tends to decrease, and methods that focus on automatic reading of digital mammograms become highly desirable. It has been proven that double reading (by two different experts) of mammograms increases the accuracy but also naturally increases the costs. Thus, making it even more imperative to incorporate computer-aided diagnosis systems to assist medical professionals in achieving cost efficiency and diagnostic effectiveness and thereby enabling more appropriate and timely treatment.

In more litigious environments, the increasing risk to health care organizations and providers due to error in detection and interpretations has become extremely costly. Therefore, it is becoming a necessity to adopt new methods to facilitate not only more accurate detection and then treatment but also better preventative measures. Health care organizations have already accumulated large raw knowledge assets in the form of administrative and clinical data. What is now important for them to do is to maximize the potential of this strategic asset, hence the need for embracing data mining.

Data Mining Techniques and Their role in healthcare

Data mining techniques are not only used in the detection of diseases but they also are beneficial in helping to compare the different procedures required for a prognosis. For example, a physician who has newly started in practice can learn from the association of different procedures to certain diagnoses, which is the result of exploratory data mining and thus can take advantage of these findings to more effectively treat their patients, rather than depending more on the prolonged “trial and error” diagnostic path which is both more time consuming and a lower quality of care approach. The following data mining techniques are recognized for being of great benefit to many areas in business, engineering, as well as other industries. Health care should not be an exception in the application of these techniques:

• Association Rules

• Clustering

• Neural Networks

• Decision Trees

While we acknowledge that there are numerous data mining techniques, we focus on these techniques since they are some of the major techniques that are most suitable in our opinion to the medical sciences. The first two techniques are used for exploratory data mining, the latter two techniques are used for predictive data mining. We will first outline the major steps involved in data mining in order to achieve the final goal of knowledge creation before we describe each of the above data mining techniques.

Knowledge Discovery Process

Figure 1 shows the knowledge discovery process, the evolution of knowledge from data through information to knowledge (Fayyad et al., 1996) and the types of data mining (exploratory and predictive) and their interrelationships. It is essential to emphasize here the importance of the interaction with the medical professionals and administrators who should always play a crucial and indispensable role in a knowledge discoveryprocess, as depicted in Figure 1 in the interpretation step. This is particularly true when we take into consideration features that are specific to the medical databases. For example, more and more medical procedures employ imaging as a preferred diagnosing tool. Thus, there is a need to develop methods for efficient mining in databases of images, which is inherently more difficult than mining in numerical databases. Other significant features include but are not limited to security and confidentiality concerns and the fact that the physician’s interpretation of images, signals, or other clinical data, is written in unstructured English, which is also very difficult to mine (McGee, 1997). Some important data issues that data mining is most useful in helping organizations wrestle with include: huge volumes of data, dynamic data, incomplete data, imprecise data, noisy data, missing attribute values, redundant data, and inconsistent data.

Figure 1 also shows how data goes through the following process steps before being used for any decision-making:

• Selection: selecting the data according to some criteria, e.g., all those people who are suffering from or at risk of cardiac complications.

• Preprocessing: this is the data cleansing stage where certain unwanted information which may not be relevant or useful to the analysis is removed.

• Transformation: the data is not merely transferred but also changed using various mathematical manipulations (such as logarithmic transformations).

• Data mining: this stage is concerned with the extraction of patterns from the data. It includes choosing a data-mining algorithm, which is appropriate to discover a particular pattern in the data.

• Interpretation and evaluation: this is where human interaction and intervention is essential, specifically the patterns identified by the system are interpreted into knowledge by humans and thereby redundant or irrelevant patterns are removed while patterns deemed useful are translated into potential treatment decisions.

Figure 1. Overview of the knowledge discovery process

Association Rule Mining

Association rules are used to discover relationships between attribute sets for a given input pattern. Such relationships do not necessarily imply causation, they are only associations. For example, an association rule that can be derived from medical data could be that 80% of the cases that display a given symptom are diagnosed with a similar condition and hence improves diagnostic capabilities. These patterns (associations) are not easily discovered using other data mining techniques. The support of an association rule is the percentage of cases which include the antecedent of the rule, while the confidence of the association rule is the percentage of cases where both the antecedent and the consequence of the rule are displayed. Only rules whose support and confidence exceed predetermined thresholds are considered useful. The classic algorithm used to generate these rules is the Apriori algorithm (Laura, 1990).

Advantages of Association Rule

• Association rules are readily understandable.

• Association rules are best suited for categorical data analysis

• It is widely used in hospitals to maintain patient’s records.

• The outcomes are easy to interpret and explain and thus easy to use in the aiding of decision making.

Disadvantages of Association Rule Mining

• Generate too many rules and sometimes these are even trivial rules.

• The association rules are not expressions of cause and effect, rather they are descriptive relationships in particular databases, so there is no formal testing to increase the predictive power of these rules.

• Insight, analysis and explanation by health care professionals are usually required to identify the new and useful rules and thereby achieve the full benefits from such association rules.

Clustering

In clustering we are trying to develop groupings that are internally homogenous, mutually exclusive and collectively exhaustive. For example, in the study and treatment of chromosomal and DNA-related problems the clustering technique is important. This technique is an exploratory data mining technique. The outcome from the clustering process can then be used as input into a decision tree or neural network (Berkhin, 2002).

The most frequently used clustering method is k-means. This is a geometrical method, which uses the distance from the average location of all the members of a particular cluster to place a specific data point. The whole data field is divided into numbers and then these numbers are normalized. The value of each field is interpreted as the distance from the origin along corresponding axes. The initial clusters are randomly defined and computationally refined during the clustering process. The working of the clustering technique is dependent on two main criteria: (1) the members of a cluster should be most similar to each other, and (2) members of any two different clusters should be most dissimilar.

In most cases clusters are usually mutually exclusive but in some instances they may be overlapping, probabilistic or have hierarchical structures. In k-means a data point is assigned to the cluster which has the nearest centroid (i.e., the nearest mean). Clustering requires the data in numeric form since it works by assigning the cluster points accordingly. This process of assigning points to clusters continues until points stop changing positions (i.e., cluster hopping).

Advantages of Clustering

• The main strength of clustering is that it is an undirected knowledge discovery technique.

• The clustering can be used as a preparatory technique for other data mining techniques such as decision trees or neural networks.

• The outcome of clustering can be visually represented and hence easily understood.

• Creating clusters reduces the complexity of the problem by subdividing the problem space into more manageable partitions.

• The more separable the data points the more effective clustering is.

Disadvantages of Using Clustering

• Clustering represents a snap shot of the data at a certain point in time and thus may not be as useful in highly dynamic situations.

• Sometimes the clusters generated may not even have a practical meaning.

• Sometimes it is possible not to spot the cluster since you do not know what you are looking for.

• Clustering can be computationally expensive.

Neural Networks

The technique of neural networks is modeled after the human brain and normally consists of many input nodes, one or more hidden (middle) layer nodes and one or more output nodes. The input and output nodes relate to each other through the hidden layer. The input layer represents the raw information that is fed into the network. The hidden layer represents a computational layer that transforms the inputs coming from the input layer into inputs to the output layer. The behavior of the output layer depends on the activity of the hidden layer where the weights between the hidden and output layers are used as a reconciliation mechanism to help minimize the difference between the actual and desired outputs.

The outcome of a neural network is improved through the minimization of an error function, i.e., namely the difference between a desired output and an actual output value. The most widely used algorithm used to minimize this error function is known as backpropagation. Each input pattern is evaluated individually and if its value exceeds a predetermined threshold, then a pre-specified rule fires (i.e., is activated) whereby its outcome is fed forward to the next layer. The firing rule is an important concept in neural networks and accounts for their high flexibility since it determines how one calculates whether a subsequent neuron (node) should fire for any given input pattern.

The most important application of neural networks is pattern recognition. The network is trained to associate specific output patterns with input patterns. The power of neural networks comes into play in its predictive abilities, i.e., associating an input pattern that has not previously been classified with a specific output pattern. In such cases, the network will most likely give the output that corresponds to a pre-classified input pattern that is least different from the new input pattern.

Neural networks are mainly used in the medical sciences in recognizing disease types from various scans such as MRI or CT scans. The neural networks learn by example and therefore the more examples we feed into the neural network the more accurate its predictive capabilities become. Neural networks can process a large number of medical records, each of which includes information on symptoms, diagnoses, and treatments for a particular case. The use of neural network as a potential tool in medical science is exemplified by its use in the study of mammograms. In breast cancer detection the primary task is detection of a tumorous cell in the early stages. The best probability for a successful cure of this disease is in its early detection. Therefore, the power of neural networks lies in that they could be used to detect minute changes in tissue patterns (a key indicator of the existence of malignant cells) that are often difficult to detect with the human eye.

Advantages of Neural Networks

• Neural networks are good classification and prediction techniques when the results of the model are more important than the understanding of how the model works.

• Neural networks are very robust in that they can be used to model any type of relationship implied by the input patterns.

• Neural networks can easily be implemented to take advantage of the power of parallel computers with each processor simultaneously doing its own calculations.

• Neural networks are also very robust in situations where the data is noisy.

Disadvantages of Neural Networks

• The key problem with neural networks is the difficulty to explain its outcome. Unlike decision trees, neural networks use complex nonlinear modeling that does not produce rules and hence it is hard to justify one’s decision.

• Significant preprocessing and preparation of the data is required.

• Neural networks will tend to over-fit the data unless implemented carefully. This is due to the fact that the neural networks have a large number of parameters which can fit any data set arbitrarily well.

• Neural networks require extensive training time unless the problem is small.

Decision Trees

In critical decision situations, mistakes could be costly and have far reaching impacts. Thus data mining techniques are adopted in an attempt to minimize such mistakes. Decision trees split the available information in a treelike form and then arrive at a final decision by continuously refining the decision choices. The decision is usually made based on the choice between binary outcomes. For example, consider the binary decision of choosing between two methods—surgery and radiation—in the case of cancer treatments.

Decision making permeates health care but is of particular significance in the treatment of life-threatening diseases such as cancer. The decision tree then becomes a particularly powerful tool in such circumstances. Particularly in the case of cancer, early detection is critical since the disease grows rapidly and secondaries are more likely to develop in the meantime. A principal decision-making aspect is to decide quickly upon the specific treatment technique and then administer it and proceed with the delivery of care. For example, in Figure 2 we can see a simple decision tree that tries to model the underlying decision problem of which drug to administer under which circumstances/conditions. At the root (the top node), the data is split into two partitions with respect to this decision problem, where one partition reflects cases where the Na/K ratio is less than or equal to 14.6 and it is not clear which drug should be administered, while the other partition represents the cases where the Na/K ratio is greater than 14.6 and it is clear that Drug Y is the drug of choice. Partition 1 therefore needs to be further subdivided into three sub-partitions; namely high (partition 1.1), low (partition 1.2) or normal (partition 1.3) blood pressure cases. In the case of partition 1.1 we can see that the choice is narrowed to Drug A or Drug B, so further sub-partitioning is required (namely, partition 1.1.1 and partition 1.1.2) and is performed on age in order to get clear decisions. It then becomes clear that age is a deciding factor between administering Drug A or Drug B—something that could not be seen from partition 1.1 or even less obvious from partition 1.

Advantages Of Decision Tree

• The graphical representation of a decision tree makes it a convenient and user-friendly modeling technique since it becomes very easy to visually follow the appropriate decision path and thereby facilitate accurate decision making.

• The decision tree algorithms are not only very fast and efficient to implement but the results are also unambiguous and thus easy to interpret. This feature of easy interpret-ability becomes of even greater significance in the medical sciences because often times doctors must justify their treatment decisions, such as in litigation instances.

• Decision trees can handle categorical (non-numeric) decision variables which are common in the medical sciences such as in Figure 2 where the decision variable is the drug to be administered.

Figure 2. Data mining resulting in the decision tree – each path from the tree root down represents a rule (i.e. a type of pattern) Knowledge stage: Knowledge Type of data mining: Predictive

• Decision trees can handle modeling situations where there is missing data, a situation that better mirrors practice.

• Decision trees prioritize the variables, using those with the most predictive power early in the partitioning process, hence using the most informative data first.

Disadvantages of Decision Trees

• When a decision variable is continuous, a categorization scheme needs to be developed first before applying the decision tree. Any weaknesses in this categorization scheme will be reflected in the outcome of the decision tree technique.

• There are dependencies between generations of splits (i.e., partition 1 impacts partitions 1.1, 1.2, etc.).

• The way in which decision trees handle numeric variables sometimes leads to loss of information due to loss of detail.

CONTRIBUTION TO KNOWLEDGE ASSETS

Irrespective of the specific data mining technique adopted, the significant common outcome of the application of all of these techniques is the generation of new knowledge. This newly created knowledge grows the extant knowledge base of the organization and thus not only adds value to its intangible assets but also increases its overall organizational value as new management techniques, such as the balanced scorecard, have demonstrated (Kaplan & Norton, 1996). In today’s knowledge-based economy sustainable strategic advantages are gained more from an organization’s knowledge assets than from its more traditional types of assets. Therefore, processes, tools and techniques that serve to grow the knowledge assets of an organization and thereby increase their value are strategic necessities to effectively compete in today’s economy.

Healthcare is noted for using leading edge medical technologies and embracing new scientific discoveries to enable better cures for diseases and better means to enable early detection of most life threatening diseases. However, the healthcare industry globally, and in the US specifically, has been extremely slow to adopt key business processes (such as knowledge management) and techniques (such as data mining) (Wickrama-singhe et al., 2003; Wickramasinghe & Mills, 2001). “Despite its information-intensive nature, the healthcare industry invests only 2% of gross revenues in information technology, compared with 10% for other information-intensive industries” (Bates et al., 2003). Furthermore, “[e]ven though US medical care is the world’s most costly, its outcomes are mediocre compared with other industrial nations” (Bates et al., 2003). Therefore, making more of an investment in key business processes and techniques is a strategic imperative for the US healthcare industry if it is to achieve a premier standing with respect to high value, high quality and high accessibility of its healthcare delivery system.

In the final report compiled by the Committee on the Quality of Healthcare in America (Crossing the Quality Chasm, 2001), it was noted that improving patient care is integrally linked to providing high quality healthcare. Furthermore, in order to achieve a high quality of healthcare the committee identified six key aims — namely that healthcare should be: (1) safe: avoiding injuries to patients from the care that is intended to help them, (2) effective: providing services based on scientific knowledge to all who could benefit and refraining from providing services to those who will not benefit (i.e., avoiding under-use and overuse), (3) patient-centered: providing care that is respectful of and responsive to individual patient preferences, needs, and values and ensuring that patient values guide all clinical decisions, (4) timely: reducing waiting and sometimes harmful delays for both those receiving care and those who give care, (5) efficient: avoiding waste and (6) equitable: providing care that does not vary in quality based on personal characteristics.

Most of the poor quality connected with healthcare is related to a highly fragmented delivery system that lacks even rudimentary clinical information capabilities resulting in poorly designed care processes characterized by unnecessary duplication of services and long waiting times and delays (ibid). The development and application of sophisticated information systems is essential to address these quality issues and improve efficiency, yet healthcare delivery has been relatively untouched by the revolution of information technology, new business management processes such as knowledge management or new techniques such as data mining that are transforming so many areas of business today (Wickramasinghe et al., 2003; Wickramasinghe & Mills, 2001; Bates et al., 2003; Crossing the Quality Chasm, 2001; Wickramasinghe, 2000; Stegwee & Spil, 2001; Wickramasinghe & Silvers, 2002).

CONCLUSIONS

This topic attempted to provide a survey of the four major data mining techniques and their application to the medical science field in order to realize the full potential of the knowledge assets in healthcare. We also presented an enhanced framework of the knowledge discovery process to highlight the interrelationships between data, information and knowledge as well as between knowledge creation and the key steps in data mining. There is no single data mining technique that will be best under all circumstances in healthcare as well as other industries. However, a comparative analysis of the various techniques as they are used in medical science suggests the following:

1. Neural networks are general and flexible thus can model situations with either numeric or non-numeric data. Further, they can handle noisy data effectively. However, their major limitation is that it is difficult to understand the reasoning behind their outcomes.

2. Decision trees on the other hand are very intuitive to understand but are not as flexible, nor as tolerant with noisy data.

3. Clustering provides a powerful exploratory data mining technique, however it can also be very computationally expensive. Further, it could sometimes generate clusters that are difficult to justify in practice. Clustering can be used as a first step for either decision trees or neural networks.

4. Association rules are very general in nature and their outcomes are very easy to understand, since these outcomes are made up of nested if-then rules. However, they require human insights to identify which rules are significant and useful and which are trivial.

Finally, we discussed the importance of data mining to the growing of the organizational knowledge assets and argued why this is of such significance in today’s knowledge-based economy for healthcare. Clearly, as much as there is a need for such techniques in creating knowledge-based organizations in healthcare, there is still much to be done before knowledge management enabled through the adoption of these data mining techniques is diffused en mass throughout healthcare organizations and thereby enabling these organizations to realize the full benefits of their knowledge assets. This topic then has served as an overview of how to realize the full potential of knowledge assets in the medical sciences using various key data mining techniques. There is naturally scope for future research to take a more detailed view of these techniques within the medical sciences.