Information Technology Reference
In-Depth Information
rather than exact matching (substitution). A probable substitution is considered
helpful in matching and given a positive score that is lower than that of an exact
match. After all these scores are determined, the score of the matched frame is
obtained.
To recap, consider the cases shown in Table 1, in which authors contain dif-
ferent styles and even some errors like two consecutive commas. A rule-based
approach like [5] must enumerate all possible combinations of insertions or dele-
tions, making it hard to handle unseen cases or unexpected errors. However,
FBA can tackle such variations, in that, it can capture multiple forms of the
same reference concept in one compact frame (e.g. F : M : L ). Since multiple rules
can be facilitated by the same frame, the number of frames could be exceedingly
small, which is indeed the case shown in Table 1. Though the precision of FBA
might be slightly sacrificed, the recall is much higher. Consequently, human labor
is drastically reduced in FBA approach.
Tabl e 1. Comparison of rule- and frame-based method for various number part and
one author in literal part of reference string (including errors)
Rule-based:
Frame-based:
Rule-based:
Frame-based:
One rule for each case
One frame to One rule for each case Two frames to
(tolistjustafew)
cover all cases
(to list just a few) cover all cases
Vol. 38, no. 2, pp. 115-126
K.C. Wang
Volume 38, suppl 2, p. 11-126
Yue-Kuen Kwok
38, 2, 115-126
H.-J. Li
38(2):115-126
Yen, Jerome
38(2:115-126
V : I : P
Burghard von Karger F : M : L
38.2, 115-126
Tung X. Bui
L : F : M
38 115-126
Bui, Tung X.
38,, 2,,, 115-126
Li,, H.-J.
38:2:115-126
Chen.. S. L.
4 Experiments
4.1 Experiment Setup
First of all, 30,000 reference records were retrieved from publicly available digital
libraries on the web, and 5,076 bibliographic (journal) styles are collected from
EndNote 2 . The reference strings were generated from each of those 5,076 journal
styles. We then randomly selected 10,000 and 20,000 strings for training (denoted
as the TrainingSet) and test (denoted as the EndNoteSet), respectively. In ad-
dition, we used the BibPro set [2] consisting of 10,000 reference strings with six
journal styles, and the FluxHS [4] set containing 2,000 journal reference strings
in health science domain. Finally, we randomly collected 1,500 journal reference
strings from multiple researchers websites to be the “free style set”. Hence, there
are one training set and four test sets for FBA and CRFs.
2 http://endnote.com/downloads/styles
 
Search WWH ::




Custom Search