Introduction to Ranking Models - Text Mining Techniques for Healthcare Provider Quality Determination

Information Technology Reference

In-Depth Information

to that adherence (or the lack thereof).(Hoy et al., 2007) However, by examining the time between phy-

sician visits, or the time between prescription refills, we can get an idea of patient compliance.(Darkow

et al., 2007) We can also use the further access of services to determine whether adverse effects have

occurred.(Hartung et al., 2007)

Once a claims database has been investigated, it should be validated through the use of an additional

data source. This can be self-reports, or it can be the use of a secondary database.(Setoguchi et al., 2007;

Wolinsky et al., 2007) If the same database is used, but say for the next year, results will tend to be very

similar. This will show that the model is reliable, meaning that results will continue to be similar. This

does not, however, demonstrate validity.

There are only a handful of papers that discuss data preprocessing. Gold and Do examine three algo-

rithms to extract patient sub-groups from claims databases. (Gold & Do, 2007) However, the paper does

not discuss the algorithms themselves. Therefore, we go to the original papers that were referenced by

Gold and Do. Freeman, et. al. developed a logistic regression of diagnosis codes to predict the occur-

rence of a breast cancer case in the Medicare-SEER database.(Freeman, Zhang, Daniel H Freeman, &

Goodwin, 2000) The included codes were 174xx, 2330, v103, 8541-8549, 8521-8523, 403, 8511-8512,

8735-8737, 8873, 8885, 9221-9229, 9985, 9925. Corresponding CPT and HCPCS codes were also used.

The study showed that there was a 99.86% specificity rate but a sensitivity of 90%. There were a total

of 62 false positives, although it suggests that half of the false positives were in fact for recurrent breast

cancer that was previously diagnosed and treated outside of the SEER database. These 62 cases represent

just under 2% of the whole, which will not significantly impact the study of patient outcomes. However,

while giving the algorithm, the paper does not provide the actual SAS coding used to extract the data.

Nattinger, et. al. developed a different algorithm, using ICD9 codes 174-174.9, 233.0, v10.3, 238.3,

239.3, 198.2, 198.81, 140-173.9, 175-195.8, 197-199.1 excluding 198.2, 198.81, 200-208.91, 230-234.9,

excluding 233.0 and 232.5, 235-239.9, excluding 238.3, 239.3, 85.1-85.19, 85.20-85.21, 85.22-85.23,

40.3, 85.33-85.48, and 92.2-92.29. This list is somewhat more expansive than the list of Freeman, et.al.

Yet the study reports lower sensitivity and specificity compared to Freeman's study. Again, the actual

coding is not provided with the study.(Nattinger, Laud, & Rut Bajorunaite, 2004) Therefore, these studies

provide little to serve as a template to extract patient sub-groups from the datasets. Because they were

also created to extract breast cancer patients, these algorithms also cannot be generalized to other cancer

types. Welch, et. al. provide a flow chart to indicate how the patients were identified, again with no cod-

ing.(Welch, Fisher, Gottlieb, & Barry, 2007) Norton, et.al. indicated that all diagnoses that occurred in

0.5% of the population or more were considered.(Garfinkel et al., 1998) However, their objective was

to determine whether the diagnoses were complications of a surgical procedure, or not. The purpose was

to investigate the proportion of complications by hospital to determine the quality of care.

Because so little information is given concerning data preprocessing, we are generally left to as-

sume that it has been done correctly, and that someone qualified in data preprocessing performed the

required operations and wrote the correct program code.(West et al., 2005) We will provide SAS code

throughout this text when preprocessing is needed. The code can be used as a basic template that you

can adapt to your own data.

Search WWH ::

Custom Search

Home