Information Technology Reference
In-Depth Information
An instance of this method is described in [4]. They define a similarity measure
between two pages that takes document structure (text in headers and footers, esp.
page numbers), layout information (font structure), and content (text on pages) into
account. They use a single-linkage agglomerative clustering method to group pages
together. The clustering process is bounded by manually set thresholds. They report
a maximum separation accuracy of 95.68%, using a metric from [5] that measures
the correctness of the number of separation points between non-adjacent pages.
Since our data is different and we solve a combined problem of classification and
separation ([4] only perform separation), their results cannot directly be compared
to ours.
8.3 Data Preparation
The input to our separation solution is the text delivered by an OCR engine of
scanned page images. We are primarily reporting on data from the mortgage pro-
cessing industry, hence the document types (Appraisal, Truth in Lending, etc.). Our
sample here contains documents from 30 document types. The quality of the images
varies based on their origin (original or photocopy) and treatment (fax). Figures 8.1
and 8.2 show two sample images (one from a Loan Application, one from a Note)
and some of the OCR text generated from them.
In order to be prepared for the core classification algorithms (see below), the
input text is tokenized and stemmed. Tokenization uses a simple regular expression
model that also eliminates all special characters. Stemming for English is based on
the Porter algorithm.[6] 4
The stream of stemmed tokens isolated from a scanned image is then converted
into a feature vector. We are using a bag of words model of text representation; each
token type is represented by a single feature and the value of that feature is the
number of occurences of the token on the page. In addition, the text is filtered using
a stopword list. This filtering removes words that are very common in a language;
for instance, in English the list includes all closed-class words such as “the,” “a,”
“in,” “he,” etc. Table 8.1 shows some of the features and their values extracted from
a Note. The entries in the table indicate the processing that the text underwent.
Note that these three processes introduce two significant abstractions over the
input text:
By stemming, we assume that the detailed morphological description of words
is irrelevant for the purpose of classification. For instance, we are unable to
tell whether the feature “address” in Table 8.1 came from the input “address,”
“addresses,” or “addressing.” Inflectional and part-of-speech information is lost.
Using bags of words, we are abstracting from the linear structure of the input
text. We pose that there is little value in knowing which word appeared before
or near another and the only important information is in knowing which word
appears more frequently than others.
The application of a stopword list, finally, de-emphasizes the value of syntactic
information even further, since many syntactically disambiguating words are
ignored.
4 We only apply stemming for English text. Text in other languages is used without
morphological processing.
Search WWH ::




Custom Search