Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

Table 8.3. Classification Results

Optimized Not optimized

precision recall F1-value precision recall F1-value

Micro averages

0.95

0.90

Macro averages

0.94

0.86

0.87

0.82

0.79

0.78

Support Vector Machines solve a binary classification problem. The SVM score

associated with an instance of the considered events is its signed distance to the

separating hyperplane in units of the SVM margin. In order to solve multiclass

problems, a series of Support Vector Machines have to be trained, e.g., in the case

of a one-vs-all training schema, the number of SVMs trained is given by the number

of classes. The scores between these different machines are not directly compara-

ble and the scores must be calibrated such that at least for a given classification

instance the scores are on an equal scale. In this application, the scores not only

must be comparable between classes for a given classification instance (page), but

also between different classification instances (pages), i.e., the SVM scores must be

mapped to probabilities. Platt [13] uses SVM scores that are calibrated to class

membership probabilities by adopting the interpretation of the score being propor-

tional to the logarithmic ratio of class membership probability. He determines the

class membership probability as a funcion of the SVM score by fitting a sigmoid

function to the empirically observed class membership probabilities as a function

of the SVM score. The fit parameters are the slope of the sigmoid function and/or

a translational offset. The latter parameter, given the interpretation of the SVM

scores discussed above, is the logarithmic ratio of the class prior probabilities. The

method used here [14] fixes the translational offset and only fits the slope parame-

ter. In addition, the Support Vector Machines are trained using cost factors for the

positive as well as for the negative class and optimize the two costs independently.

Empirical studies performed by the authors showed that cost factor optimization

in conjunction with fitting the slope parameter of the mapping function from SVM

scores to probabilities yields superior probability estimates than fitting the slope

and the translational offset without cost factor optimization, fitting the slope and

the translational offset with cost factor optimization, and fitting the slope only.

Table 8.3 summarizes the classification results for different loan forms. The re-

sults shown in the Optimized heading are the classification results obtained with the

class membership probabilities using cost factor optimization and fitting the slope

of the sigmoid function. Using SVM scores directly without calibration and cost

factor optimization yields the results under the heading Not Optimized . The macro

averages, especially, illustrate the effectiveness of the elected method. The observed

improvement is a combined effect of using probabilities instead of SVM scores and

cost factor optimization. An added benefit of optimizing the positive and negative

cost factors is an improved handling of the OCR noise. As discussed in section 8.3,

OCR increases the feature space considerably and cost factor optimization becomes

important in order to avoid overfitting to the training corpus.

In summary, the effects of cost factor optimization can be interpreted as follows:

The ratio of positive to negative cost factors determines the right class prior prob-

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home