Information Technology Reference
In-Depth Information
ReferenceFrame
S L OT
1: Literal Part
SLOT
1: Authors
SLOT
1: First Name
2: Middle Name
3: Last Name
PATTERN
[1]:[2]:[3]
2: Title
3: Journal
PATTERN
[1]:[2]:[3]
[2]:[1]:[3]
2: Number Part
SLOT
4: Volume
SLOT
1: Volume Prefix
[Volume]
[Vol]
2: Digits
RegExp
\d+
PATTERN
[1]:[2]
5: Issue
SLOT
1: Issue Prefix
*Supplement*
[No]
2: Digits
RegExp
\d+
PATTERN
[1]:[2]
6: Page
7: Year
PATTERN
[4]:[5]:[6]:[7]
PATTERN
[1]:[2]
Fig. 1. An illustration of the frame-slot representation of the RME domain knowledge
pre-collected dictionaries are used to tag author, title and journal as A , T ,and
J , respectively. A reference string is first tokenized by whitespace, and then the
dictionaries are used to assign single or multiple tags for each token. Subse-
quently, frequent trigram tags are examined to generate frames such as “AAT”,
“TTA” and “TTJ”. In addition, 40% of the titles in the training data are en-
closed by quotation marks, so they are used to designate the boundary of T and
J . Furthermore, over 60% of the year field exists between A and T , according to
previous analysis [5]. Thus, it is also included as an indicator of boundary.
Authors are usually either “F M L” or “L, F M”, in which “F”, “M”, and “L”
indicate first, middle, and last name, respectively. Most author names in references
would be written following a consistent style and abbreviation convention. Hence,
abbreviation patterns can be used to determine the end of the author field.
For the title, the length of the title in a normal reference string is often more
than three words, and few punctuations, such as commas or periods, would occur
within the title. In contrast, punctuations, especially commas, are commonly
used to separate author names. Therefore, we calculate the distance between
Search WWH ::




Custom Search