Information Technology Reference
In-Depth Information
so as to make people think that, rule-based approach is not suitable for NLP or
AI in general. On the other hand, fine-grained linguistic knowledge cannot be
easily captured by current machine learning models, which often resulted in less
desirable recognition accuracy. Therefore, how to make the best out of rule-based
and statistical approaches has always been a challenging task. In light of this,
we propose a novel frame-based approach (FBA) and use reference metadata
extraction (RME) as a case study to demonstrate its advantages.
The task of RME is to automatically extract the metadata of input reference
or citation strings 1 , where metadata is defined as a set of structured data, such
as the author, title, etc. However, automatic RME often struggles with the vari-
ations between field separators. For example, the author and title fields can be
separated by spaces or periods, while the volume and issue fields can be sep-
arated by braces or parentheses [1]. Moreover, RME is a punctuation-sensitive
task, since the missing punctuations between and/or within fields often cause
ambiguity. Even within each field, there can be punctuation and spacing differ-
ences. To further complicate this problem, there are many drastically different
citation styles (i.e., different field orders).
The main contributions of this research are three-fold. First, the new frame
matching algorithm, based on sequence alignment, can compensate for the short-
comings of traditional rule-based approach, in which rule matching lacks flexi-
bility and generality. Second, an approximate matching is adopted for capturing
reasonable abbreviations or errors in the input reference string to further in-
crease the coverage of the frames. Third, experiments conducted on extensive
datasets show that the same knowledge framework performed equally well on
various untrained domains.
2 Related Work
Previous work can roughly be divided into three categories: machine learning
based, template-based, and knowledge-based (or rule-based) methods. The ma-
chine learning based approach, which casts RME as a sequential labeling or
classification problem in token level, take advantage of the probabilistic estima-
tion based on training sets of tagged bibliographical data. [11] use the Hidden
Markov Model to extract important fields from the headers of computer science
research papers. [8] apply classifier such as the support vector machine (SVM)
for this task. [10] employ CRFs to extract various common fields from the head-
ers and citations of research papers. The template-based approach contains the
BibPro system [2] and those developed by [3] and [6]. They use a template mining
approach for citation extraction. The BibPro system transforms citation strings
into protein-like sequence and uses Basic Local Alignment Search Tool (BLAST)
to find the highest similarity score for predicting fields of a citation string. [6]
use three templates for extracting information from citations. The advantage
of such models is its eciency. Thirdly, the knowledge-based methods include
1 The two terms “reference” and “citation” will be used interchangeably in this paper.
 
Search WWH ::




Custom Search