Text Mining and Patient Severity Clusters - Text Mining Techniques for Healthcare Provider Quality Determination

Information Technology Reference

In-Depth Information

example, there are only diagnosis codes, and common terms are in fact common diagnoses. It is the less

common diagnoses that are given higher weights compared to more common terms. Therefore, 'flu' will

have a low weight while 'metastatic lung cancer' will have a much higher weight. We initially use the

standard entropy weighting method, so that the most common ICD9 codes (hypertension, disorder of

lipid metabolism or high cholesterol) will be given low weights while less common (uncontrolled Type

I diabetes with complications) will be given higher weights.

Clustering was performed using the expectation maximization algorithm. (Cerrito, 2007) It is a

relatively new, iterative clustering technique that works well with nominal data in comparison to the

K-means and hierarchical methods that are more standard. The clusters are identified by the terms that

most clearly represent the text strings contained within the cluster. It does not mean that every patient

has the representative combination of terms. It does mean that the linkages within the terms are the ones

that have the highest identified weights.

This chapter gives an application of the use of text mining to the development of a patient severity

index. It is not intended to give a complete discussion on how text mining is performed, and how natu-

ral language processing has enhanced the development of text mining. Instead, we refer the interested

reader to several textbooks that provide detailed information on the process of text mining.(Feldman &

Sanger, 2006; Kao & Poteet, 2006; Weiss, Indurkhya, Zhang, & Damerau, 2004)

In this chapter, we will develop an artificial language of text strings that are composed of nouns. The

nouns in the string represent patient conditions and/or procedures. Then, the methodology that was ini-

tially developed to analyze sentences and paragraphs can also be used on these text strings. The natural

language methods take advantage of the linkage between codes that are related to any one patient. It is

the combinations of codes rather than the individual codes that are used to define clusters of patients.

Consider one such string: 682.7, 250.02, 730.17, 681.10, 401.9, 593.9, and 285.29. The translations

are Other cellulitis and abscess of foot except toe, Type II diabetes without complications, Chronic

osteomyelitis of the ankle and foot, Cellulitis and abscess, unspecified, Essential hypertension, Un-

specified disorder of kidney and ureter, and Anemia of other chronic illness. This patient has diabetes

and long-term problems with foot ulcers, cellulitis, and infection in the bone. In addition, this patient

has kidney problems with anemia. A second string is 682.7, 041.11, v09.0, and 042. This patient has

Other cellulitis and abscess of foot except toe, Staphylococcus aureus, Infection with microorganisms

resistant to penicillins, and Human immunodeficiency virus [HIV] disease. This patient has foot ulcers

with resistant infection related to HIV rather than to diabetes.

text analysIs of dIagnosIs codes

We first look at the diagnosis codes for the patients. Using all of the 3 digit codes in one text string, we

can use the text analysis to define a total of ten different clusters. Using a 1% sample of the National

Inpatient Sample for 2005 because of the computational time involved, we find the clusters given in

Table 1. Once the clusters are defined, the scoring mechanism can be used in SAS Enterprise Miner to

place all of the patients into the identified text clusters.

It is clear that clusters 2 and 4 are focused primarily upon childbirth. We will first look at the relation-

ship of these text clusters to patient outcomes. Then we can also define clusters after first eliminating

all diagnoses related to childbirth, which are very specific and not specifically related to patients with

other diagnoses. Table 2 gives the probability of mortality by cluster. At this point, the clusters were

Search WWH ::

Custom Search

Home