Information Technology Reference
In-Depth Information
human experts. This ad hoc review process involves the identifi cation of individuals
that have relative expertise to determine whether the results produced by a system
are meaningful. Meaningfulness can be specifi ed either as a binary decision (e.g . ,
yes/no) or along a graded scale (e.g . , 1[best]-5[worst]). To accommodate for inher-
ent biases that may be introduced as a consequence of human subjectivity, it is
considered good practice to have two or more reviewers. The relative agreement
between experts can then be quantifi ed using a statistical test, such as Cohen's
Kappa [ 42 ] (best for cases where there are two experts) or Fleiss' Kappa [ 43 ] (which
works for scenarios that involve more than two experts). In cases where the results
to be evaluated may be intractable, it is common practice to analyze a statistically
signifi cant sample (which may be determined either as a defi ned proportion of the
entire result set or as an even sampling of result types to be evaluated). The main
advantage to ad hoc review evaluation is that one does not require a priori knowl-
edge of what might constitute a meaningful result. On the other hand, the challenge
of determining the value of an ad hoc review is that it is a largely subjective deter-
mination and can be biased based on the expertise of the reviewers.
The use of a “gold standard” provides an objective benchmark against which
results from a knowledge discovery system can be compared. A gold standard is
made of verifi ed results that are to be expected from an accurate knowledge
discovery system. Results from a system are then categorized as either: (1) True
Positive (TP) - those results that match an expected result; (2) False Positive (FP) -
those results that are reported as relevant from a system, but not found in the gold
standard; (3) True Negative (TN) - those results that are not expected and also not
reported by the system; and, (4) False Negative (FN) - those results that are not
reported by the system but are expected. Building on these categorizations, addi-
tional statistics are used to quantify the system relative to the gold standard. The two
most commonly used are: (1) Sensitivity (Sn) - which assesses the system's ability
to detect expected results, calculated as TP/(TP + FN); and, (2) Specifi city (Sp) -
which assesses the correctness of the results returned by the system, calculated as
TN/(TN + FP). Statistically, Sn and Sp respectively quantify the Type 1 (incorrect
rejection of a true null hypothesis) and Type 2 (incorrect rejection of a false null
hypothesis) errors.
Benchmarking relative to a gold standard offers an objective assessment of
system performance; however, the challenge with gold standards are related to
the completeness and appropriateness of a gold standard for a given context.
Appreciating that a gold standard may not actually be complete or contain all pos-
sible solutions, they may also be referred to as a “reference standard.” Another
related shortcoming with a gold or reference standards is the general inability to
completely enumerate what should be a true negative for a given system. This is
especially the case in bibliome mining context. Of course, the ideal situation is
one where the complete set of results can be compared to a gold standard and then
evaluated according to Sensitivity and Specifi city. However, the reality is that it is
often diffi cult to determine the true negative rate or even completely specify what
should not be expected within a gold standard. To address this, an additional sta-
tistic is used, called the Positive Predictive Value (PPV) or Precision (Pr), which
Search WWH ::




Custom Search