Frame-Based Approach for Reference Metadata Extraction - Technologies and Applications of Artificial Intelligence - page 158

Information Technology Reference

In-Depth Information

rather than exact matching (substitution). A probable substitution is considered

helpful in matching and given a positive score that is lower than that of an exact

match. After all these scores are determined, the score of the matched frame is

obtained.

To recap, consider the cases shown in Table 1, in which authors contain dif-

ferent styles and even some errors like two consecutive commas. A rule-based

approach like [5] must enumerate all possible combinations of insertions or dele-

tions, making it hard to handle unseen cases or unexpected errors. However,

FBA can tackle such variations, in that, it can capture multiple forms of the

same reference concept in one compact frame (e.g. F : M : L ). Since multiple rules

can be facilitated by the same frame, the number of frames could be exceedingly

small, which is indeed the case shown in Table 1. Though the precision of FBA

might be slightly sacrificed, the recall is much higher. Consequently, human labor

is drastically reduced in FBA approach.

Tabl e 1. Comparison of rule- and frame-based method for various number part and

one author in literal part of reference string (including errors)

Rule-based:

Frame-based:

Rule-based:

Frame-based:

One rule for each case

One frame to One rule for each case Two frames to

(tolistjustafew)

cover all cases

(to list just a few) cover all cases

Vol. 38, no. 2, pp. 115-126

K.C. Wang

Volume 38, suppl 2, p. 11-126

Yue-Kuen Kwok

38, 2, 115-126

H.-J. Li

38(2):115-126

Yen, Jerome

38(2:115-126

V : I : P

Burghard von Karger F : M : L

38.2, 115-126

Tung X. Bui

L : F : M

38 115-126

Bui, Tung X.

38,, 2,,, 115-126

Li,, H.-J.

38:2:115-126

Chen.. S. L.

4 Experiments

4.1 Experiment Setup

First of all, 30,000 reference records were retrieved from publicly available digital

libraries on the web, and 5,076 bibliographic (journal) styles are collected from

EndNote 2 . The reference strings were generated from each of those 5,076 journal

styles. We then randomly selected 10,000 and 20,000 strings for training (denoted

as the TrainingSet) and test (denoted as the EndNoteSet), respectively. In ad-

dition, we used the BibPro set [2] consisting of 10,000 reference strings with six

journal styles, and the FluxHS [4] set containing 2,000 journal reference strings

in health science domain. Finally, we randomly collected 1,500 journal reference

strings from multiple researchers websites to be the “free style set”. Hence, there

are one training set and four test sets for FBA and CRFs.

2 http://endnote.com/downloads/styles

Next Page

Technologies and Applications of Artificial Intelligence

Search WWH ::

Custom Search

Home