Mining the Bibliome - Translational Informatics: Realizing the Promise of Knowledge-Driven Healthcare

Information Technology Reference

In-Depth Information

Weighted frequency approaches are generally more reliable, since they account for

the relative importance of a given relationship within a particular context.

A common weighted frequency approach is TF*IDF, where the Term Frequency

(TF; the frequency of a given term or relationship occurring in a given document) is

normalized according to the Inverse Document Frequency (IDF; the frequency of

the given term or relationship occurring in the entire collection of documents).

TF*IDF was fi rst demonstrated by Salton in his System for Mechanical Analysis

and Retrieval of Text (SMART) information retrieval system [ 39 ]. The incorpora-

tion of TF*IDF into the SMART system was the fi rst demonstration of the potential

utility of modeling techniques for representing concepts and their relative relation-

ships to each other. In formal terminology, TF*IDF is an example of a “Vector

Space Modeling” technique that enables for one to identify the closeness of two

potentially related concepts in a mathematical (algebraic) space. For the SMART

system, the concepts of interest were the documents themselves, with the goal being

to identify potentially related documents to one another. This can be generalized to

identify potential relationships between concepts based on relative relationships to

one another. Recent studies have also shown how genetic information can also be

incorporated within a vector space modeling approach to identify potential related-

ness between concepts (e.g . , between genetically related diseases [ 40 ] or between

potential medicinal plants and therapeutic applications [ 41 ]).

The development of modeling techniques for discerning potential relationships

between concepts of interest continues to be an active area of research. Computational

approaches continue to be needed and enhanced that can accommodate not only the

heterogeneity of how data are represented in the plethora of potential knowledge

sources, but also accommodate the volume of data that are being generated as a

product of a highly accelerated data generation process. These challenges contribute

to those that are even more generally seen with the leveraging of big data for the

purposes of real-time knowledge generation (as further described in Chap. 7 ).

5.2.5

Plausibility of Discovered Knowledge: Evaluation

The mere development of approaches for identifi cation of potentially important

concepts or potential relationships through modeling techniques is of little value if

one cannot ascertain the value of the potentially identifi ed knowledge. Rigorous

evaluation is thus essential for the establishment of trust for knowledge discovery

systems. Evaluation of bibliome mining can be done in one of two ways: (1) Ad hoc

review by experts; or (2) Relative to a pre-defi ned gold standard. There are relative

merits and challenges to each approach, but the general principle is that potential

results of an algorithm need to be quantifi ed in some way so that one can ascertain

the reliability of the predictions.

Since the overall principle behind developing computational approaches for

identifying new knowledge is meant to actually refl ect human intuition for discover-

ing new knowledge, a common approach for evaluation involves the leveraging of

Search WWH ::

Custom Search

Home