Databases Reference
In-Depth Information
the function used to estimate the numbers of true and false positives. This
improved function could be used to guide a new template search.
The best-first search described above may miss out on good templates be-
cause of its greedy decision making. This would be true even if the estimates of
the numbers of true and false positives were perfect, owing to the structure of
the graph: we can't guarantee that the best parents will have the best children.
One likely improvement therefore would be a population based search, such as
a simple beam search or an evolutionary algorithm. Evolutionary algorithms
have been successfully used to solve a wide range of multi-objective optimi-
sation problems [10], including problems where evaluations must be limited
due to time or financial constraints [15]. Extracting information from a large
corpus takes considerable computing effort, so this is an aspect worth consid-
ering. Multi-objective evolutionary algorithms can e ciently generate a range
of Pareto-optimal solutions, and so explore the trade-off between the different
objectives. In our case, this means that for each number of true positives, we
find the template with the fewest false positives, and for each number of false
positives, we find the template with the most true positives. This produces a
range of solutions from which the user can then select whichever template or
templates are most suitable for their particular problem. This is more flexible
than the simple weighting suggested in Sect. 6.5.
Finally, rather than starting with a seed fragment and a template consist-
ing solely of literals, we could start the search using a hand-written template.
This would not have to be optimised in advance, and in some cases, would
be easy to create. The search could then start from a point chosen to be use-
ful and optimised further through similar search processes to those outlined
above.
8 Conclusion
We have presented a formal framework to describe information extraction,
focusing on the definition of the template patterns used to convert free text
into a structured database. The framework has allowed us to explicitly iden-
tify some of the fundamental issues underlying information extraction and to
formulate possible solutions. We have shown that the framework allows com-
putationally feasible heuristic search methods to be developed for automatic
template creation. We have shown that a practical implementation of this
framework is feasible and allows automatic template creation. We also hope
that the framework will allow other researchers to gain further insights into
the theory and practice of information extraction and text mining.
Acknowledgements
This work is partly funded by the BBSRC grant BB/C507253/1, “Biological
Information Extraction for Genome and Superfamily Annotation.”
Search WWH ::




Custom Search