A Logical Framework for Template Creation and Information Extraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

the function used to estimate the numbers of true and false positives. This

improved function could be used to guide a new template search.

The best-first search described above may miss out on good templates be-

cause of its greedy decision making. This would be true even if the estimates of

the numbers of true and false positives were perfect, owing to the structure of

the graph: we can't guarantee that the best parents will have the best children.

One likely improvement therefore would be a population based search, such as

a simple beam search or an evolutionary algorithm. Evolutionary algorithms

have been successfully used to solve a wide range of multi-objective optimi-

sation problems [10], including problems where evaluations must be limited

due to time or financial constraints [15]. Extracting information from a large

corpus takes considerable computing effort, so this is an aspect worth consid-

ering. Multi-objective evolutionary algorithms can e ciently generate a range

of Pareto-optimal solutions, and so explore the trade-off between the different

objectives. In our case, this means that for each number of true positives, we

find the template with the fewest false positives, and for each number of false

positives, we find the template with the most true positives. This produces a

range of solutions from which the user can then select whichever template or

templates are most suitable for their particular problem. This is more flexible

than the simple weighting suggested in Sect. 6.5.

Finally, rather than starting with a seed fragment and a template consist-

ing solely of literals, we could start the search using a hand-written template.

This would not have to be optimised in advance, and in some cases, would

be easy to create. The search could then start from a point chosen to be use-

ful and optimised further through similar search processes to those outlined

above.

8 Conclusion

We have presented a formal framework to describe information extraction,

focusing on the definition of the template patterns used to convert free text

into a structured database. The framework has allowed us to explicitly iden-

tify some of the fundamental issues underlying information extraction and to

formulate possible solutions. We have shown that the framework allows com-

putationally feasible heuristic search methods to be developed for automatic

template creation. We have shown that a practical implementation of this

framework is feasible and allows automatic template creation. We also hope

that the framework will allow other researchers to gain further insights into

the theory and practice of information extraction and text mining.

Acknowledgements

This work is partly funded by the BBSRC grant BB/C507253/1, “Biological

Information Extraction for Genome and Superfamily Annotation.”

Search WWH ::

Custom Search

Home