Information Technology Reference
In-Depth Information
A Frame-Based Approach for Reference
Metadata Extraction
Yu-Lun Hsieh 1 , Shih-Hung Liu 1 , Ting-Hao Yang 1 , Yu-Hsuan Chen 1 ,
Yung-Chun Chang 1 ,GladysHsieh 1 , Cheng-Wei Shih 1 , Chun-Hung Lu 2 ,
and Wen-Lian Hsu 1
1 Institute of Information Science, Academia Sinica, Taipei, Taiwan
{ morphe,journey,tinghaoyang,smallright,changyc,gladys,
dapi,hsu } @iis.sinica.edu.tw
2 Innovative Digitech-Enabled Applications & Services Institute, III, Taiwan
enricoghlu@iii.org.tw
Abstract. In this paper, we propose a novel frame-based approach
(FBA) and use reference metadata extraction as a case study to demon-
strate its advantages. The main contributions of this research are
three-fold. First, the new frame matching algorithm, based on sequence
alignment, can compensate for the shortcomings of traditional rule-based
approach, in which rule matching lacks flexibility and generality. Second,
an approximate matching is adopted for capturing reasonable abbrevia-
tions or errors in the input reference string to further increase the cov-
erage of the frames. Third, experiments conducted on extensive datasets
show that the same knowledge framework performed equally well on var-
ious untrained domains. Comparing to a widely-used machine learning
method, Conditional Random Fields (CRFs), the FBA can drastically
reduce the average field error rate across all four independent test sets
by 70% (2.24% vs. 7.54%).
Keywords: Reference Metadata Extraction, Knowledge representation,
Frame-based approach.
1 Introduction
In natural language processing (NLP), an important task is to recognize vari-
ous linguistic expressions. Many such expressions can be represented as rules or
templates. These templates are matched by computer to identify those linguis-
tic objects in text. However, in the real world, there always seem to be many
exceptions or variations not covered by rules or templates. A typical approach
to cope with this situation is either to produce more templates or to relax the
constraints of the templates (e.g., by inserting optionals or wild cards). But the
former produces many case-by-case templates that could create more conflicts;
and the latter could lead to lots of false positives, namely, matched but unde-
sirable linguistic expressions. Thus, the inflexibility of rule-based systems has
troubled the NLP as well as the artificial intelligence (AI) communities for years
Search WWH ::




Custom Search