Soft Statistical Decision Fusion for Distributed Medical Data on Grids

abstract

This topic introduces the decision fusion as a means of exploring information from distributed medical data. It proposes a new method of applying soft data fusion algorithm on the grid to analyze massive data and discover meaningful and valuable information. It could potentially help to better understand and process medical data and provide high-quality services in patient diagnosis and treatment. It allows incorporation of multiple physicians into one single case to recover and resolve problems, and integration of distributed data sources overcome some limitations of geographical locations to share knowledge and experience based on the soft data and decision fusion approach.

introduction

Healthcare service is a complex industry nowadays. It is one of the most critical components of the modern human-oriented service. Informatics is an essential technology to health care (Dick & Steen, 1991) and has been applied to this field as long as computers have existed. Information technology can be one of the major drivers of e-health activity, both directly and indirectly. E-health offers new opportunities to further improve the quality and safety of services because technology makes possible the high level of information management. However, it raises an important issue of how to utilize and integrate an impressive amount of medical data efficiently to provide high-quality and safe services.
Grid computing has emerged to address this issue. It was first developed in the scientific community and can be used as effective infrastructures for distributed high-performance computing and data processing (Foster, Kesselman, & Tuecke, 2001). The features of grid computing make the designation of an advance decision support system possible.
How to apply data fusion in a distributed medical decision system on the grid is still an open problem. In our previous research on this subject, the following observations were made that should guide our further work:
1. Massive data are collected in different organizations. With an explosion in size of database, discovering meaningful and valuable information from different datasets on grids is still a critical issue that affects decision-making in this area. There is an urgent need for a new computation technique to help service providers to process, analyze, and extract meaningful information from the rapid growing data.
2. The need for efficient, effective, and secure communication between multiple service providers for sharing clinical knowledge and experience is increasing. Traditional techniques are infeasible for analyzing large datasets that may maintain over geographically distributed sites.
3. The need for finding an efficient way to integrate data, knowledge, and decision from different parties is increasing.
These first two observations suggest an answer: build a grid-based system that enables the sharing of application and data in an open, heterogeneous environment. The last observation suggests an answer to build a soft fusion mechanism to do summarization, and it may result in higher accuracy of diagnosis and better treatment.


RELATED WORK

There are several research groups whose work can contribute to grid-based data fusion on e-health.
We first discuss decision support on the grid in the grid community, then we will introduce some related works about the medical decision support from the health community, and finally we will present some related works about soft data fusion and our proposal for solving this problem.

Decision support on the Grid

A decision support system is defined as any computer program that assists decision-makers to utilize data and models to solve problems (Gorry & Morton, 1971; Keen & Morton, 1978; Sprague & Calson, 1980). Usually, it requires access to vast computation resources and processes a very large amount of data to make a decision. Grid computing is one approach to solving this problem. It has emerged as a paradigm with the ability to provide secure, reliable, and scaleable high-speed access to a distributed data resource. Compared to traditional distributed techniques, it has many advantages like resource sharing, high-performance services. The grid offers significant capability for designation and operation of complex decision support system by linking together a number of geographically distributed computers (Ong et al., 2004).
A grid-based decision support system can be used in a broad range of problems, from business to utilities, industry, earth science, health care, education and so on. Most researchers focus on simulation and visualization for specific processes such as air pollution (Mourino, Martin, Gonzalez, & Doallo, 2004), flooding crisis (Hluchy et al., 2004; Benkner, et al., 2003), and surgical procedures (Narayan, Corcoran-Perry, Drew, Hoyman, & Lewis, 2003; CrossGrid project) and then support decision-makers to make decisions on the basis of simulation results.

Medical Decision support

The term medical decision support system describes a set of computer applications that aredesigned to assist health service providers in clinical decision-making. It can provide assessment or specifics that are selected from the knowledge base on the basis of individual patient characteristics (Hunt, Haynes, Hanna, & Smith, 1998; Delaney, Fitzmaurice, Riaz, & Hobbs, 1999). It is typically designed to integrate a medical knowledge base, patient data, and an application to produce case-specific advice.
The decision support system has been used in health care since the 1960s. There is evidence that using a medical decision support system may increase compliance with clinical pathways and guidelines and reduce rates of inappropriate diagnostic tests (Australia’s Health Sector). It can support increased use of evidence by clinicians in direct patient care, resulting in better patient outcomes. However, the use of computerized medical decision systems is not commonplace. The results achieved have been rather low and the progress is slow (Reisman, 1996). Two identified barriers are lack of sources of knowledge and system development (Shortliffe, 1986), and lack of communication among profusion of different systems (Hobbs, Delaney, Carson, & Kenkre, 1996). As many researchers say, there is a rapidly growing need to improve medical decision-making in order to reduce practice variation, preventable medical errors (Poses, Cebul, & Wigton, 1995; Bornstein & Emier, 2001; Sintchenko & Coiera, 2003) and become feasible in the real world.

soft Data Fusion

Data fusion is the amalgamation of information from multiple sources. It can be classified as either hard fusion and soft fusion.
All data fusion efforts are initiated to be used in particular research areas. It is still a “wide open field based on the difference in technology, the expectations by the users, and the kinds of problems that biologists and life scientists try to solve” (Freytag, Etzold, Goble, Schward, & Apweiler, 2003). Fusion system application can be found inthe domain of hydrological forecasting (Abrahart & See, 2002), health care and medical diagnose (Laskey & Mahoney, 2003), and engineering (Chow, Zhu, Fischl, & Kam, 1993).
There are different approaches in the literature to fuse data. Some approaches use statistical analysis, while others use AI techniques like fuzzy logic (Chen & Luo, 1993), learning algorithms based on neural networks (Myers, Laskey, & DeJong, 1999), and Bayesian networks or uncertainty sets (Singhal & Brown, 1997) to handle the uncertainty.

research problem

In the healthcare industry, many organizations that could be located in different places collect data. In the traditional way, data is fragmented, and it is inconvenient for service providers to share experience and knowledge. Physicians would change medical decisions if they had enough “knowledge.” Assume that they may collaborate under some agreements on some concerned problems. The following are some typical scenarios of example processes in e-health:
• Health service providers collaborate on the analysis of newly discovered disease or pathogenic bacteria.
• Health service providers collaborate on the estimate of the patient’s state and providing appropriate treatments.
• Health service providers share the experience and knowledge with others.
• Health service providers get the support from others to make decisions for uncertain cases.
In these scenarios, collaboration across geographical location is needed to enable the sharing of data and knowledge and then make a decision. Greater benefits can be achieved if data integration is used rather than simply data collection(Sensmeier, 2003). Our research problems can be described as the following:
• Quality of the medical decision. Liability and reliability are two main issues of the medical decision system. Decision support tools must be carefully designed so that they are reliable and accurate (Sloane, Liberatore, & Nydick, 2002). How does one utilize grid computing to perform the data process? How does one use grid-based distributed data mining to discover knowledge? How does one make use of massive distributed data to improve the quality of service?
• Collaboration. How do the different parties collaborate with each other? How can we get data about patients and transmission through the Internet? Can data be integrated in different levels? How does one integrate the data, information, knowledge, and decisions from different organizations?
The following are our goals:
• For service providers. Share knowledge and experience to make high QoS decisions and choose the optimal treatment for the individual patient.
• For patients. Get better medical care that includes more accurate diagnosis results and better treatments.

research plan

In order to achieve all the requirements to make a system available, we need to integrate the network with multiple health service providers and patients. Grid computing allows flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources; it can be used as effective infrastructures for distributed high-performance computing and data processing (Foster, Kesselman, & Tuecke, 2001).
The grid-based system of logical architecture is presented in Figure 1.
Figure 1 shows the logical architecture diagram of the system. The computers with service provider applications are the nodes of the grid. Health service providers communicate, exchange data, and share experience and knowledge through the grid service. Computers on the grid can be desktop, laptop, pocket PC, cell phone, and so on. Users of this system can be doctors, specialists, and assistants. They — users and computers — can communicate with each other through standard or wireless Internet. The typical data workflow is described in the following scenarios:
• Data of patient’s current situation is sent to doctor.
• Decision support system starts to analyze data.
• Decision support system invites other doctor / system on the grid to participate in the diagnosis.
• Decision support system collects results from the grid and fuses results, then it generates decision.
• Doctor sends back results to patient.
In order to perform such tasks, the decision support system has three integrated modules: grid service module, fusion service module, and user service module. The software architecture is shown in Figure 2.
Several modules work in the system to carry out tasks. The user friendly interface can accept tasks from and send back responses to users; the grid service agent module is used to provide basic services to manage the grid, coordinate actions, and resolve resources among nodes. The fusion agent module is used to implement different levels of data fusion. The data analysis agent module is used to analyze data and make diagnosis and decisions. Both fusion and data analysis modules work with the medical database. These three components are composed of middleware to provide services to the user.
Figure 1. Logical architecture
Logical architecture
Figure 2. Software architecture
Software architecture
 
To illustrate how this will work, we will first describe some key issues that are associated with grid-based distributed knowledge management technologies, as well as methodology for decision-making. We will then discuss plans to develop a feasible fusion mechanism for data, knowledge, and decision assembly.

Knowledge Management Technologies

Some factors that influence the quality of a medical decision system include: the quality of the underlying knowledge base used in the system; the incomplete dataset; and the conversion of knowledge into electronic form. Grid-based distributed data mining is used in this system to solve these problems and thus improve the quality of decisions. We will discuss these following issues:
• Data privacy. Medical data are sensitive and proprietary (Shamos, 2004); many hospitals and organizations treat health and medical data as their own property and are not willing to share with others. Proposed solutions to this issue include de-identification (Tracy, Dantas, & Upshur, 2004; Li & Chen, 2004; Kline, Johnson, Webb, & Runyon, 2004) and data-centered (Du & Zhan, 2002; Kantarcio-glu & Clifton, 2002). We propose to follow both ways to protect data privacy.
• Data preparation. One barrier that influences the quality of a decision is the quality of medical data. They are often incomplete or out of time. The crucial information is missing when the decision is made. Data preparation is important to generate a high-quality decision. Conceptual reconstruction is one of options to be used to solve this problem (Aggarwal & Parthasarathy, 2001).
• Data communication. There is a security concern about this system. The use of grid security infrastructure (GSI) (Welch et al., 2003) allows secure authentication and data exchange over an opened network.
• Data mining methods. Using data mining in medical decisions can generate a decision of high accuracy (Kusiak, Kern, Kernstine, & Tseng, 2000). Several data mining algorithms are available for a decision-making system including: classification trees, case-based reasoning, neural network, genetic algorithm, fuzzy set approach, SVM, and so on. Our ongoing work is based on SVM and case-based reasoning. We will test and evaluate other algorithms in our future research.

Grid Toolkit

Grid computing is one of the innovative distributed computational models. It can offer high-performance computing and data processing abilities for the distributed mining and extraction of knowledge from data repositories. Grid applications are used in many fields including scientific computing, environmental monitoring, geohazard modeling, and business. It can be used as an effective infrastructure for distributed computing and data mining (Foster, 2001).
Grid technology is growing up very quickly and is going to be more and more complete and complex both in the number of tools and in the variety of supported applications (Cannataro, 2001). Compared to traditional distributed techniques, it has many advantages like resource sharing and high-performance services. Many existing tools are designed to provide functionalities of integration, resource management, access, process large datasets, and support the knowledge discovery process.
Globus Toolkit (Globus project group) is one candidate to implement grid management. It is a well-known grid middleware for providing gridresource management, security management, and other grid facilities. Globus Toolkit 3.0 is the first grid platform to fully support the OGSA/OGSI standard. It provides functionalities to discover, share, and monitor the resources. It also provides abilities for the mutual authentication of services and protection of data (Butler, 2000). The fusion module and application module are built on the base of Globus Toolkit.

Distributed Knowledge Discovery

The distributed knowledge discovery is a process that applies artificial intelligent theories and grid techniques to extract knowledge from distributed databases on the grid. Such processes can be implemented in the following steps:
1. Data preparation. The first step of this process is data preparation. Data retrieved from different databases on the grid needs to be pre-processed for two main reasons: o Medical data is sensitive. The data in this domain cannot be obtained without privileges. In order to protect data security, we follow two different ways: de-identification and data-centered. For de-identification, medical data in all organizations can be categorized into two parts: One part is pure medical data without any identified personal information of patients, it is accessible for all partners on the grid; the other part is full dataset with private patient information, it is only accessible for the owner. For data center, data processing is performed on a local data source. After processing, analysis results are based on data not data itself exchanged among the grid. Data and different levels of analysis results can be input of different level of fusion agent. o Medical data may not be complete. In order to minimize noise causedby incomplete data, pre-process data is necessary. The idea of conceptual reconstruction can be used to fill the missing data. One refill process is performed by calculating the mean and deviation of individual data values of each attribute; the other one is performed by finding the approximate patterns and calculating the mean of patterns. Different algorithms are used to calculate and fill missing values.
2. Data exchange. The second step of this process is data exchange. Grid tools provide secure, high-throughput data transfer. This model is set in terms of layer, just like many other grid-based knowledge discovery systems. Services provided by this model are set in a three-layer infrastructure as depicted in Figure 3.
3. Data analysis. The third step is data analysis. Several methods are used to analyze data. To take advantage of the grid, different data analysis applications can be used on different machines. Statistical algorithms and data mining algorithms are used to train medical data and generate results for a given case. In addition, they are used to discover useful knowledge on the distributed database. The knowledge explored includes membership functions and fuzzy rule sets. It aims to generate a number of fuzzy rules and membership functions by applying data mining algorithm to a collection of dataset on grids.
4. Fusion analysis. The system has the self-developing ability to analyze fusion results. Fusion logs are kept in a database, which enable system learning from outside resources.

Fusion Technologies

In real-world applications, a very large amount of data may be kept in the distributed database and can be accessed at an acceptable rate. Collaboration among many organizations is an important issue in making decisions with high accuracy. To combine data, information, and decisions from different parties, fusion is used in some research (Azuaje, Dubitzky, Black, & Adamson, 1999; Phegley, Perkins, Gupta, & Dorsey, 2002). The fusion technologies can be applied to different application domains.
Figure 3. Three-layer grid-based architecture
Three-layer grid-based architecture
We propose four possible levels of fusion with interactive discussion including different approaches to manage the fusion process. The integration of data and decision can occur in these four distinct levels — data, information, knowledge, and decision. Fusion service is based on basic grid mechanisms. It is built on top of grid services. Decisions made on the basis of local data are collected and fused. Such process aims to generate reliable decisions for health service providers by applying data mining algorithm and AI technologies to a collection of dataset on grids. It overcomes the disadvantages of human fusion and machine fusion: The former is limitedby knowledge and subjective experience, while the latter is inflexible and relies on data excessively.
On account of the features of medical data, we propose soft fusion in our work, including fuzzy logic and simple weighting / voting. Neural network fusion and Bayesian fusion would be our future test.

Four-Level Fusion

A four-layer fusion framework is proposed for integration. It includes data fusion, information fusion, knowledge fusion, and decision fusion.
• Data fusion. This is the lowest level of fusion. The system collects data from multiple sources on the grid and provides categorized data that requires further processing and analysis to users. Users can make decisions or the system can generate decisions based on the process results of collected raw data.
• Information fusion. This is the second level of fusion. The system collects data that have common features from multiple sources on the grid and produces a more informative dataset for the users.
• Knowledge fusion. This is the third level of fusion. The system finds relevant features of data for each data source on the grid and summarizes knowledge from multiple nodes into a new knowledge base. Users can make decisions using knowledge base.
• Decision fusion. This is the highest level of fusion. The system gathers decisions and combines decisions coming from multiple nodes on the grid. The result is given as a system decision. In this topic, we propose a dynamic decision fusion mechanism. The decision-making process is a negotiable process. It is not only a gather-and-combination procedure, but also allows decision-makers who have different opinions to discuss the issue of concern and make a final decision.
Figure 4. Four-layer fusion
Four-layer fusion

Hybrid Interactive Fusion

To the best of the authors’ knowledge, most of the current fusion systems have one feature in common: They fuse data or decisions on the basis of gathered items. They follow the simple flow — gather data, summarize, and give results. The limitation is clear; it is inflexible and relies excessively on computers.
We propose a dynamic fusion system with interactive discussion between different nodes on the grid. When the gathered decisions are not consistent, some further actions are performed. If service providers are available, they are invited to take a Web conference or telephone conference to discuss among each others. If they are not available, the system may be required to re-draw conclusions by exchanging training datasets or processing training datasets, including filter some unnecessary attributes using different algorithms to handle incomplete data. Compared to traditional fusion systems, it offers higher reliability and diagnosis accuracy by allowing users to confer with others in order to get consistency. This fusion process is described in Figure 5.
The dynamic fusion process simulates the human decision-making process. In the real world, such processes may involve decision-making, discussion, and re-decision-making. The dynamic fusion process is carried out as follows:
1. Systems analyze local datasets separately. Different data analysis applications, such as SVM, gene algorithm, and neuron network, can be used in different sites. The size of datasets varies from node to node.
2. Consistent threshold is set by the user. It includes information about the number of users with consistent and confidence. It provides an indication of how certain we are about the decision fusion and what the acceptable fusion is. Consistent threshold is measured as <X, Y>. X and Y are values between zero to one. A value close to 1 for X means high consistency and for Y means high certainty. Similarly, a value close to 0 for X means low consistency and for Y means low certainty. <0.7, 0.7> is one example of consistent threshold, which means the acceptable result is that at least 70% of the decisions with average confidence 0.7 are identical. The higher the values of X, Y are, the more difficult to get satisfied answers but the more reliable results are.
Figure 5. Fusion process
Fusion process
3. Collect decisions from the grid and calculate consistent parameter <x, y>. Assume set S and set T are declared as decisions with the same results and total decisions, respectively.
x = (size of S) / (size of T) y = mean of confidence values in S
For example, there are eight systems in this medical group, and for some given case, six systems have the same diagnosis with confidence (0.8, 0.6, 0.5, 0.9,1, 0.75). So, x = 0.75 and y = 0.76
4. If consistent confidence is less than consistent threshold, that is,
tmp22-64
the further fusion process is activated. Otherwise, results with confidence from different resources of the grid are input of fusion algorithm like voting/weight and neural network to generate the final system result.
5. Results come from the grid and the final system results are written into the database as fusion history. Every result counts in this system because it will be used as input by fusion algorithm and make contribution for future fusion. Systems accumulate knowledge in this way.
In the best case, all decision-makers on the grid have the same results in the first round of fusion. In this situation, no further fusion is needed. But in the worst case, decisions coming from different makers can be very different. Eight people may have eight different results for the same case. Exchange of opinions and discussion are necessary. Some decision-makers may change their decisions after discussion, and the fusion process is not implemented until it gets satisfied results.
Potential users of this system are doctors, physicians, specialists, and their assistants. The system can provide partial functionalities without human interactions in certain conditions. There are three types of the further fusion processes according to different types of users in the context of the grid:
• Human-to-human. The fusion process is implemented in a human-to-human environment. It occurs when users of systems are available and the system works as a decision assistant. The purpose of the system is to provide suggestions for users to make decisions by getting consistent results from group users. Once the system collects results from the grid, it determines the degree of consistency and compares it with the threshold. If it is lower than consistent, the system activates the further fusion including online discussion, e-mail, phone conference, and video conference. Doctors will discuss as to the given case just like the real-world situation. The system fusion decision starts again after some doctors change decisions.
• Machine-to-machine. The fusion process is implemented among machines. It occurs when systems work automatically without human interactions. Once the system determines to carry out the further fusion process, one of the two proposed methods can be implemented: re-analyze data using a different training dataset; or re-analyze data using the same training dataset, but only some important attributes are take into account. The first method involves data exchange and low-level data fusion, the latter one involves data preparation and middle-level information fusion. This process may repeat several times until it gets satisfied consistency.
• Human-to-machine. Not all doctors are available on the grid. The fusion process is implemented in a hybrid way. If the system needs to carry out the further fusion, it can have two parts: invite available doctors to discuss directly like in the human-to-human situation; or suggest system re-analyze data by providing part of the local set to remote systems or determining the attributes that are used for further analysis. Then, fuse the results again.
The proposed fusion mechanism follows the way humans make decisions. The AI technologies make it smarter and more reliable.

conclusion

A novel method for data fusion with interactions among decision-makers is described. This method takes advantages of observation of other decision-makers’ opinions and then modifies the result. This method simulates the process of human decision-making. It involves decision-making, decision fusion, discussion, and re-making, re-fuse. It can improve reliability and flexibility of the fusion system.
The goal of this project is to develop a generalized decision-making and fusion system on grid approach to improve accuracy of diagnose.
A system using SVM for learning from medical data and fuzzy logic for making decisions is designed. Simulations based on Wisconsin Breast Cancer database and Heart Disease (UCI ML Repository) show that the new system is effective in terms of decision accuracy. Even more promising is that higher accuracies are possible if other AI techniques are used.

Next post:

Previous post: