Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

An instance of this method is described in [4]. They define a similarity measure

between two pages that takes document structure (text in headers and footers, esp.

page numbers), layout information (font structure), and content (text on pages) into

account. They use a single-linkage agglomerative clustering method to group pages

together. The clustering process is bounded by manually set thresholds. They report

a maximum separation accuracy of 95.68%, using a metric from [5] that measures

the correctness of the number of separation points between non-adjacent pages.

Since our data is different and we solve a combined problem of classification and

separation ([4] only perform separation), their results cannot directly be compared

to ours.

8.3 Data Preparation

The input to our separation solution is the text delivered by an OCR engine of

scanned page images. We are primarily reporting on data from the mortgage pro-

cessing industry, hence the document types (Appraisal, Truth in Lending, etc.). Our

sample here contains documents from 30 document types. The quality of the images

varies based on their origin (original or photocopy) and treatment (fax). Figures 8.1

and 8.2 show two sample images (one from a Loan Application, one from a Note)

and some of the OCR text generated from them.

In order to be prepared for the core classification algorithms (see below), the

input text is tokenized and stemmed. Tokenization uses a simple regular expression

model that also eliminates all special characters. Stemming for English is based on

the Porter algorithm.[6] 4

The stream of stemmed tokens isolated from a scanned image is then converted

into a feature vector. We are using a bag of words model of text representation; each

token type is represented by a single feature and the value of that feature is the

number of occurences of the token on the page. In addition, the text is filtered using

a stopword list. This filtering removes words that are very common in a language;

for instance, in English the list includes all closed-class words such as “the,” “a,”

“in,” “he,” etc. Table 8.1 shows some of the features and their values extracted from

a Note. The entries in the table indicate the processing that the text underwent.

Note that these three processes introduce two significant abstractions over the

input text:

•

By stemming, we assume that the detailed morphological description of words

is irrelevant for the purpose of classification. For instance, we are unable to

tell whether the feature “address” in Table 8.1 came from the input “address,”

“addresses,” or “addressing.” Inflectional and part-of-speech information is lost.

•

Using bags of words, we are abstracting from the linear structure of the input

text. We pose that there is little value in knowing which word appeared before

or near another and the only important information is in knowing which word

appears more frequently than others.

•

The application of a stopword list, finally, de-emphasizes the value of syntactic

information even further, since many syntactically disambiguating words are

ignored.

4 We only apply stemming for English text. Text in other languages is used without

morphological processing.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home