Information Technology Reference
In-Depth Information
1.4.5.4 Validation of the Method
The annual competition organized by the TREC (Text REtrieval Conference)
conference is a reference in the area of automatic language processing. The
methodology that has been described was used in the “routing” task of the
TREC-9 competition. The routing competition consists in ranking a large
number of texts, in order of decreasing relevance for a large number of topics.
For the TREC-9 routing competition, two text corpuses were used, relevant to
63 and 500 topics respectively, totaling 294,000 documents. Clearly, the task
cannot be accomplished manually or semiautomatically: a fully automated
procedure must be implemented. The above approach won the competition,
for both corpuses. Figure 1.40 shows the scores of the participants [Stricker
2001].
1.4.6 An Application in Bioengineering: Quantitative
Structure-Relation Activity Prediction for Organic
Molecules
The investigation of quantitative structure-activity relations (QSAR) of mole-
cules is a rapidly growing field thanks to progress in molecular simulation. The
objective of QSAR is the prediction of chemical properties of molecules from
structural data that can be computed ab initio , without actually synthesizing
the molecule; thus, costly organic syntheses, leading to molecules that turn out
not to have the desired property, can be avoided [Hansch 1995]. That approach
is especially useful in the field of bio-engineering, for the prediction of phar-
macological properties of molecules and for computer-aided drug discovery. It
is also extremely valuable for solving conceptually similar problems, such as
the prediction of properties of complex materials from their formulation, the
prediction of thermodynamic properties of mixtures, etc.
Why are neural networks useful in that context? If there exists a deter-
ministic relation between some features of the molecule and the property that
must be predicted, then QSAR is amenable to a regression problem, i.e., to
the determination of that unknown relation, from examples. If that relation
is nonlinear, then neural networks can be advantageous, as argued above.
A prerequisite for such an approach is the availability of databases for
training and testing the model. Because of the industrial importance of the
problem, many databases of existing molecules for such properties as the boil-
ing point, water solubility, or water-octanol partition coe cients (known as
“LogP”) are available. The latter property is important in pharmacology, be-
cause it gives a quantitative assessment of the ability of the molecule to cross
biological barriers in order to be active; similarly, in the field of environment,
the value of LogP of pesticides contributes to assessing their impact on envi-
ronment.
Once the availability of appropriate databases is guaranteed, the relevant
features that should be the inputs of the model must be determined. In the
Search WWH ::




Custom Search