Information Technology Reference
In-Depth Information
Figure 10. Decision tree following regression
tion to patients with a primary diagnosis of COPD. This is approximately 245,000 patients in the NIS
dataset. Table 3 gives the list of diagnosis codes used. The choice of these codes will be discussed in
more detail in Chapter 5. Table 4 gives the list of procedure codes taken from Chapter 3. This example
was discussed briefly in Chapter 3 for the interval outcomes of length of stay and total charges. Here,
we first examine a prediction of mortality.
If we perform standard logistic regression without stratified sampling, the false positive rate remains
small (approximately 3-4%), but with a high false negative rate (minimized at 38%). Given the large
dataset, almost all of the input variables are statistically significant. The percent agreement is 84% and
the ROC curve looks fairly good (Figure 11).
If we perform predictive modeling and stratify the sample to the rarest level, the accuracy rate drops
to 75%, but the false negative rate is considerably improved. Figure 12 gives the ROC curve from pre-
dictive modeling. It shows that the model predicts considerably better than chance in the testing set.
We will examine the stratified sampling in more detail in the next section.
cHange In sPlIt In tHe data
The analyses in the previous section assumed a 50/50 split between mortality and non-mortality. We
want to look at the results if mortality composes only 25% of the data, and 10% of the data. Table 5
gives the regression classification breakdown for a 25% sample; Table 6 gives the breakdown for a 10%
sample.
Note that the ability to classify mortality accurately is decreasing with the decrease of the split; almost
all of the observations are classified as non-mortality, but also at a cost of a high level of false positives.
Search WWH ::




Custom Search