Information Technology Reference
In-Depth Information
Weighted frequency approaches are generally more reliable, since they account for
the relative importance of a given relationship within a particular context.
A common weighted frequency approach is TF*IDF, where the Term Frequency
(TF; the frequency of a given term or relationship occurring in a given document) is
normalized according to the Inverse Document Frequency (IDF; the frequency of
the given term or relationship occurring in the entire collection of documents).
TF*IDF was fi rst demonstrated by Salton in his System for Mechanical Analysis
and Retrieval of Text (SMART) information retrieval system [ 39 ]. The incorpora-
tion of TF*IDF into the SMART system was the fi rst demonstration of the potential
utility of modeling techniques for representing concepts and their relative relation-
ships to each other. In formal terminology, TF*IDF is an example of a “Vector
Space Modeling” technique that enables for one to identify the closeness of two
potentially related concepts in a mathematical (algebraic) space. For the SMART
system, the concepts of interest were the documents themselves, with the goal being
to identify potentially related documents to one another. This can be generalized to
identify potential relationships between concepts based on relative relationships to
one another. Recent studies have also shown how genetic information can also be
incorporated within a vector space modeling approach to identify potential related-
ness between concepts (e.g . , between genetically related diseases [ 40 ] or between
potential medicinal plants and therapeutic applications [ 41 ]).
The development of modeling techniques for discerning potential relationships
between concepts of interest continues to be an active area of research. Computational
approaches continue to be needed and enhanced that can accommodate not only the
heterogeneity of how data are represented in the plethora of potential knowledge
sources, but also accommodate the volume of data that are being generated as a
product of a highly accelerated data generation process. These challenges contribute
to those that are even more generally seen with the leveraging of big data for the
purposes of real-time knowledge generation (as further described in Chap. 7 ).
5.2.5
Plausibility of Discovered Knowledge: Evaluation
The mere development of approaches for identifi cation of potentially important
concepts or potential relationships through modeling techniques is of little value if
one cannot ascertain the value of the potentially identifi ed knowledge. Rigorous
evaluation is thus essential for the establishment of trust for knowledge discovery
systems. Evaluation of bibliome mining can be done in one of two ways: (1) Ad hoc
review by experts; or (2) Relative to a pre-defi ned gold standard. There are relative
merits and challenges to each approach, but the general principle is that potential
results of an algorithm need to be quantifi ed in some way so that one can ascertain
the reliability of the predictions.
Since the overall principle behind developing computational approaches for
identifying new knowledge is meant to actually refl ect human intuition for discover-
ing new knowledge, a common approach for evaluation involves the leveraging of
Search WWH ::




Custom Search