Using Layout Data for the Analysis of Scientific Literature - Mining Complex Data - page 14

Information Technology Reference

In-Depth Information

Table 1.2. Results from the classification

Data set

BayesNet DecisionTrees JRip SVM

biological

92.9

93.0

92.8 89.0

neurological

91.4

93.6

94.5 84.9

Also, the neurological data set is much more biased concerning the number of

instances in each category.

During the experiment we used a standard accuracy as ratio of correctly

classified entities to overall entities under 10-fold cross-validation. Precision and

recall were evenly distributed. The final confusion matrix in the biological data

set showed some mix up between raw data and models, while the precision and

recall for full pages were much higher. For the neurological data set, this trend

intensified. Logos could be identified almost perfectly, while the raw data had

much less accuracy (F-Score for raw data was only 0.73). Full pages was also

problematic (F-Score 0.7), but that seems natural considering that there were

only 13 instances to learn from.

Unlike the method proposed by [26] we decided to test several machine learn-

ing algorithms for comparison. We used Bayes net, decision trees, JRip and sup-

port vector machines with standard parameters from the Weka Tool [27]. While

the rule learner JRip and decision trees generally performed similarly, Bayes Net

worked significantly better on the biological data set than on the neurological

data set. SVMs performance was far below average. As SVM naturally does not

support multiple categories, we had to use binary classifiers between all possi-

ble combinations of categories. This lead to overlapping classifications and the

complete exclusion of sparsely set categories.

1.6

Table Detection

Besides finding images and their captions, a second use for using layout infor-

mation springs to mind: the analysis of tables. In order to identify the tables

in a paper, we decided to take an algorithm from OCR, as already discussed in

the Background section. The T-Recs algorithm was designed to find tables in

scanned pages, but can be adjusted to work in a vector-based environment as

well.

1.6.1

The T-Recs Algorithm

The method presented by Kieninger [28] can be split into three steps. First,

possible table relationships are identified by searching for regular structures in

the layout of the text. Next, some error-correcting methods are employed and

finally, the actual table structure is identified and table content is separated from

non-table content.

In the first step, text units are identified, by melting together words that

overlap horizontally. An overlap is defined as:

Next Page

Mining Complex Data

Search WWH ::

Custom Search

Home