Information Technology Reference
In-Depth Information
to that adherence (or the lack thereof).(Hoy et al., 2007) However, by examining the time between phy-
sician visits, or the time between prescription refills, we can get an idea of patient compliance.(Darkow
et al., 2007) We can also use the further access of services to determine whether adverse effects have
occurred.(Hartung et al., 2007)
Once a claims database has been investigated, it should be validated through the use of an additional
data source. This can be self-reports, or it can be the use of a secondary database.(Setoguchi et al., 2007;
Wolinsky et al., 2007) If the same database is used, but say for the next year, results will tend to be very
similar. This will show that the model is reliable, meaning that results will continue to be similar. This
does not, however, demonstrate validity.
There are only a handful of papers that discuss data preprocessing. Gold and Do examine three algo-
rithms to extract patient sub-groups from claims databases. (Gold & Do, 2007) However, the paper does
not discuss the algorithms themselves. Therefore, we go to the original papers that were referenced by
Gold and Do. Freeman, et. al. developed a logistic regression of diagnosis codes to predict the occur-
rence of a breast cancer case in the Medicare-SEER database.(Freeman, Zhang, Daniel H Freeman, &
Goodwin, 2000) The included codes were 174xx, 2330, v103, 8541-8549, 8521-8523, 403, 8511-8512,
8735-8737, 8873, 8885, 9221-9229, 9985, 9925. Corresponding CPT and HCPCS codes were also used.
The study showed that there was a 99.86% specificity rate but a sensitivity of 90%. There were a total
of 62 false positives, although it suggests that half of the false positives were in fact for recurrent breast
cancer that was previously diagnosed and treated outside of the SEER database. These 62 cases represent
just under 2% of the whole, which will not significantly impact the study of patient outcomes. However,
while giving the algorithm, the paper does not provide the actual SAS coding used to extract the data.
Nattinger, et. al. developed a different algorithm, using ICD9 codes 174-174.9, 233.0, v10.3, 238.3,
239.3, 198.2, 198.81, 140-173.9, 175-195.8, 197-199.1 excluding 198.2, 198.81, 200-208.91, 230-234.9,
excluding 233.0 and 232.5, 235-239.9, excluding 238.3, 239.3, 85.1-85.19, 85.20-85.21, 85.22-85.23,
40.3, 85.33-85.48, and 92.2-92.29. This list is somewhat more expansive than the list of Freeman, et.al.
Yet the study reports lower sensitivity and specificity compared to Freeman's study. Again, the actual
coding is not provided with the study.(Nattinger, Laud, & Rut Bajorunaite, 2004) Therefore, these studies
provide little to serve as a template to extract patient sub-groups from the datasets. Because they were
also created to extract breast cancer patients, these algorithms also cannot be generalized to other cancer
types. Welch, et. al. provide a flow chart to indicate how the patients were identified, again with no cod-
ing.(Welch, Fisher, Gottlieb, & Barry, 2007) Norton, et.al. indicated that all diagnoses that occurred in
0.5% of the population or more were considered.(Garfinkel et al., 1998) However, their objective was
to determine whether the diagnoses were complications of a surgical procedure, or not. The purpose was
to investigate the proportion of complications by hospital to determine the quality of care.
Because so little information is given concerning data preprocessing, we are generally left to as-
sume that it has been done correctly, and that someone qualified in data preprocessing performed the
required operations and wrote the correct program code.(West et al., 2005) We will provide SAS code
throughout this text when preprocessing is needed. The code can be used as a basic template that you
can adapt to your own data.
Search WWH ::




Custom Search