Information Technology Reference
In-Depth Information
the content window size of five to extract word features, including the two preceding
words, the current word, and the two following words.
The advantages of normalizing words when encoding them as features haven been
shown in many sequential labeling tasks (Tsai, Sung et al. 2006). Therefore, this work
normalized all words within a context window by transforming all words into lower
cases and encoded all numeric values as the value 1.
Dictionary Features
The strings of the “str” column in the SecTag section header terminology were col-
lected to compile a dictionary for our dictionary features. For all collected section
heading strings, this work calculated their existing position information within the
dictionary, and encoded the position information using 3 bits. Table 1 shows the en-
coded results. Based on the dictionary and position information, this work developed
two dictionary features. One is the dictionary matching feature, whose value is 1 if the
current word is a substring of the terms in our dictionary. The other is the dictionary
position feature, whose value is the encoded position information if the current word
is matched with our dictionary.
Table 1. Position Information
Encoded Position Information
Description
The term only appeared in the first token among
all section headings.
001
The term only appeared in the middle token
among all section headings.
010
The term only appeared in the last token among
all section headings.
100
The term appeared in the first/middle token
among all section headings.
011
The term appeared in the middle/last token
among all section headings.
110
The term appeared in the first/last token among
all section headings.
101
The term appeared in the first/middle/last token
among all section headings.
111
In addition, in the SecTag terminology, section headings were defined within a hie-
rarchy, and associated with a level information to indicate their location within a tree.
Each heading string was also normalized to a unique string, which enables us to find
the same section heading represented in different names. The normalized section
strings and the associated level information were encoded as features.
Affix Features
An affix refers to a morpheme that is attached to a base morpheme to form a word.
This work employed two types of affixes: prefixes and suffixes. Some prefixes and
Search WWH ::




Custom Search