Information Technology Reference
In-Depth Information
example, there are only diagnosis codes, and common terms are in fact common diagnoses. It is the less
common diagnoses that are given higher weights compared to more common terms. Therefore, 'flu' will
have a low weight while 'metastatic lung cancer' will have a much higher weight. We initially use the
standard entropy weighting method, so that the most common ICD9 codes (hypertension, disorder of
lipid metabolism or high cholesterol) will be given low weights while less common (uncontrolled Type
I diabetes with complications) will be given higher weights.
Clustering was performed using the expectation maximization algorithm. (Cerrito, 2007) It is a
relatively new, iterative clustering technique that works well with nominal data in comparison to the
K-means and hierarchical methods that are more standard. The clusters are identified by the terms that
most clearly represent the text strings contained within the cluster. It does not mean that every patient
has the representative combination of terms. It does mean that the linkages within the terms are the ones
that have the highest identified weights.
This chapter gives an application of the use of text mining to the development of a patient severity
index. It is not intended to give a complete discussion on how text mining is performed, and how natu-
ral language processing has enhanced the development of text mining. Instead, we refer the interested
reader to several textbooks that provide detailed information on the process of text mining.(Feldman &
Sanger, 2006; Kao & Poteet, 2006; Weiss, Indurkhya, Zhang, & Damerau, 2004)
In this chapter, we will develop an artificial language of text strings that are composed of nouns. The
nouns in the string represent patient conditions and/or procedures. Then, the methodology that was ini-
tially developed to analyze sentences and paragraphs can also be used on these text strings. The natural
language methods take advantage of the linkage between codes that are related to any one patient. It is
the combinations of codes rather than the individual codes that are used to define clusters of patients.
Consider one such string: 682.7, 250.02, 730.17, 681.10, 401.9, 593.9, and 285.29. The translations
are Other cellulitis and abscess of foot except toe, Type II diabetes without complications, Chronic
osteomyelitis of the ankle and foot, Cellulitis and abscess, unspecified, Essential hypertension, Un-
specified disorder of kidney and ureter, and Anemia of other chronic illness. This patient has diabetes
and long-term problems with foot ulcers, cellulitis, and infection in the bone. In addition, this patient
has kidney problems with anemia. A second string is 682.7, 041.11, v09.0, and 042. This patient has
Other cellulitis and abscess of foot except toe, Staphylococcus aureus, Infection with microorganisms
resistant to penicillins, and Human immunodeficiency virus [HIV] disease. This patient has foot ulcers
with resistant infection related to HIV rather than to diabetes.
text analysIs of dIagnosIs codes
We first look at the diagnosis codes for the patients. Using all of the 3 digit codes in one text string, we
can use the text analysis to define a total of ten different clusters. Using a 1% sample of the National
Inpatient Sample for 2005 because of the computational time involved, we find the clusters given in
Table 1. Once the clusters are defined, the scoring mechanism can be used in SAS Enterprise Miner to
place all of the patients into the identified text clusters.
It is clear that clusters 2 and 4 are focused primarily upon childbirth. We will first look at the relation-
ship of these text clusters to patient outcomes. Then we can also define clusters after first eliminating
all diagnoses related to childbirth, which are very specific and not specifically related to patients with
other diagnoses. Table 2 gives the probability of mortality by cluster. At this point, the clusters were
Search WWH ::




Custom Search