Frame-Based Approach for Reference Metadata Extraction - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

so as to make people think that, rule-based approach is not suitable for NLP or

AI in general. On the other hand, fine-grained linguistic knowledge cannot be

easily captured by current machine learning models, which often resulted in less

desirable recognition accuracy. Therefore, how to make the best out of rule-based

and statistical approaches has always been a challenging task. In light of this,

we propose a novel frame-based approach (FBA) and use reference metadata

extraction (RME) as a case study to demonstrate its advantages.

The task of RME is to automatically extract the metadata of input reference

or citation strings 1 , where metadata is defined as a set of structured data, such

as the author, title, etc. However, automatic RME often struggles with the vari-

ations between field separators. For example, the author and title fields can be

separated by spaces or periods, while the volume and issue fields can be sep-

arated by braces or parentheses [1]. Moreover, RME is a punctuation-sensitive

task, since the missing punctuations between and/or within fields often cause

ambiguity. Even within each field, there can be punctuation and spacing differ-

ences. To further complicate this problem, there are many drastically different

citation styles (i.e., different field orders).

The main contributions of this research are three-fold. First, the new frame

matching algorithm, based on sequence alignment, can compensate for the short-

comings of traditional rule-based approach, in which rule matching lacks flexi-

bility and generality. Second, an approximate matching is adopted for capturing

reasonable abbreviations or errors in the input reference string to further in-

crease the coverage of the frames. Third, experiments conducted on extensive

datasets show that the same knowledge framework performed equally well on

various untrained domains.

2 Related Work

Previous work can roughly be divided into three categories: machine learning

based, template-based, and knowledge-based (or rule-based) methods. The ma-

chine learning based approach, which casts RME as a sequential labeling or

classification problem in token level, take advantage of the probabilistic estima-

tion based on training sets of tagged bibliographical data. [11] use the Hidden

Markov Model to extract important fields from the headers of computer science

research papers. [8] apply classifier such as the support vector machine (SVM)

for this task. [10] employ CRFs to extract various common fields from the head-

ers and citations of research papers. The template-based approach contains the

BibPro system [2] and those developed by [3] and [6]. They use a template mining

approach for citation extraction. The BibPro system transforms citation strings

into protein-like sequence and uses Basic Local Alignment Search Tool (BLAST)

to find the highest similarity score for predicting fields of a citation string. [6]

use three templates for extracting information from citations. The advantage

of such models is its eciency. Thirdly, the knowledge-based methods include

1 The two terms “reference” and “citation” will be used interchangeably in this paper.

Search WWH ::

Custom Search

Home