Information Technology Reference
In-Depth Information
The resources (dictionaries) used in FBA were collected from WordNet 3 ,
DBLP 4 , and the Internet. We parsed individual words in authors' name, ti-
tle and journal names from DBLP to form part of the author, title, and journal
dictionaries. A family name list containing 71,475 entries was collected from the
Internet 5 . We then aggregated these resources into corresponding dictionaries
for author, title, and journal field, which contain 308,157, 73,683, and 32,305
words, respectively.
The baseline system was built by conditional random fields (CRFs) 6 ,whichis
widely-used and well-practiced for sequential labeling problem in various natural
language applications. The same resources were used to train both FBA and
CRFs. For the training of CRFs, we adopted the features in [10], except for
their external resources that we could not obtain.
In contrast to previous RME evaluations, which used word and/or field accu-
racy, we evaluate the performance using field error rate (FER), as in (1),. The
reason is that most accuracies are close to 100%, so the reduction in the error
rate can more faithfully reflect the improvement.
Number of correctly extracted fields
Total number of fields
FER =1
×
100%
(1)
4.2 Experiment Results
At the outset, we evaluated training set performance (regarded as the inside
test) for FBA and CRFs. As shown in Table 2, the overall field error rates of
FBA and CRFs are 2.05% and 7.15%, respectively. CRFs performs well in the
literal part and some of the number part. However, FBA is slightly better than
CRFs in the number part, yet this condition is reversed in the literal part.
For the EndNote test set performance as depicted in Table 3, FBA is better
than CRFs for all fields with the overall field error rate being 2.29%, which is a
5.22% improvement over CRFs. Since this test set has the same journal styles
as in the training set, both FBA and CRFs could achieve stable results. From
the observation of performances achieved by CRFs in training and EndNote
set, the performance of the number part is lower. A reasonable explanation is
that some number part of reference is whitespace-free, but CRFs is tokenized by
whitespace, which could lead to the prediction error of CRFs. This would require
some post-processing to get partial result in the number part. If one could design
a more sophisticated tokenization strategy unaffected by punctuations, then the
performance of CRFs could be further enhanced.
Next, we compared the performance of the baseline CRFs and those reported
by [2] in the BibPro set, which only contains six journal styles from the 5,076
generated journal styles. Table 4 indicates that FBA has a stable performance
3 http://wordnet.princeton.edu
4 http://dblp.uni-trier.de/xml
5 http://www.last-names.net,http://genforum.genealogy.com/surnames
6 http://crfpp.googlecode.com
Search WWH ::




Custom Search