Mining the Bibliome - Translational Informatics: Realizing the Promise of Knowledge-Driven Healthcare

Information Technology Reference

In-Depth Information

human experts. This ad hoc review process involves the identifi cation of individuals

that have relative expertise to determine whether the results produced by a system

are meaningful. Meaningfulness can be specifi ed either as a binary decision (e.g . ,

yes/no) or along a graded scale (e.g . , 1[best]-5[worst]). To accommodate for inher-

ent biases that may be introduced as a consequence of human subjectivity, it is

considered good practice to have two or more reviewers. The relative agreement

between experts can then be quantifi ed using a statistical test, such as Cohen's

Kappa [ 42 ] (best for cases where there are two experts) or Fleiss' Kappa [ 43 ] (which

works for scenarios that involve more than two experts). In cases where the results

to be evaluated may be intractable, it is common practice to analyze a statistically

signifi cant sample (which may be determined either as a defi ned proportion of the

entire result set or as an even sampling of result types to be evaluated). The main

advantage to ad hoc review evaluation is that one does not require a priori knowl-

edge of what might constitute a meaningful result. On the other hand, the challenge

of determining the value of an ad hoc review is that it is a largely subjective deter-

mination and can be biased based on the expertise of the reviewers.

The use of a “gold standard” provides an objective benchmark against which

results from a knowledge discovery system can be compared. A gold standard is

made of verifi ed results that are to be expected from an accurate knowledge

discovery system. Results from a system are then categorized as either: (1) True

Positive (TP) - those results that match an expected result; (2) False Positive (FP) -

those results that are reported as relevant from a system, but not found in the gold

standard; (3) True Negative (TN) - those results that are not expected and also not

reported by the system; and, (4) False Negative (FN) - those results that are not

reported by the system but are expected. Building on these categorizations, addi-

tional statistics are used to quantify the system relative to the gold standard. The two

most commonly used are: (1) Sensitivity (Sn) - which assesses the system's ability

to detect expected results, calculated as TP/(TP + FN); and, (2) Specifi city (Sp) -

which assesses the correctness of the results returned by the system, calculated as

TN/(TN + FP). Statistically, Sn and Sp respectively quantify the Type 1 (incorrect

rejection of a true null hypothesis) and Type 2 (incorrect rejection of a false null

hypothesis) errors.

Benchmarking relative to a gold standard offers an objective assessment of

system performance; however, the challenge with gold standards are related to

the completeness and appropriateness of a gold standard for a given context.

Appreciating that a gold standard may not actually be complete or contain all pos-

sible solutions, they may also be referred to as a “reference standard.” Another

related shortcoming with a gold or reference standards is the general inability to

completely enumerate what should be a true negative for a given system. This is

especially the case in bibliome mining context. Of course, the ideal situation is

one where the complete set of results can be compared to a gold standard and then

evaluated according to Sensitivity and Specifi city. However, the reality is that it is

often diffi cult to determine the true negative rate or even completely specify what

should not be expected within a gold standard. To address this, an additional sta-

tistic is used, called the Positive Predictive Value (PPV) or Precision (Pr), which

Translational Informatics: Realizing the Promise of Knowledge-Driven Healthcare

Search WWH ::

Custom Search

Home