Information Technology Reference
In-Depth Information
Table 1.2. Results from the classification
Data set
BayesNet DecisionTrees JRip SVM
biological
92.9
93.0
92.8 89.0
neurological
91.4
93.6
94.5 84.9
Also, the neurological data set is much more biased concerning the number of
instances in each category.
During the experiment we used a standard accuracy as ratio of correctly
classified entities to overall entities under 10-fold cross-validation. Precision and
recall were evenly distributed. The final confusion matrix in the biological data
set showed some mix up between raw data and models, while the precision and
recall for full pages were much higher. For the neurological data set, this trend
intensified. Logos could be identified almost perfectly, while the raw data had
much less accuracy (F-Score for raw data was only 0.73). Full pages was also
problematic (F-Score 0.7), but that seems natural considering that there were
only 13 instances to learn from.
Unlike the method proposed by [26] we decided to test several machine learn-
ing algorithms for comparison. We used Bayes net, decision trees, JRip and sup-
port vector machines with standard parameters from the Weka Tool [27]. While
the rule learner JRip and decision trees generally performed similarly, Bayes Net
worked significantly better on the biological data set than on the neurological
data set. SVMs performance was far below average. As SVM naturally does not
support multiple categories, we had to use binary classifiers between all possi-
ble combinations of categories. This lead to overlapping classifications and the
complete exclusion of sparsely set categories.
1.6
Table Detection
Besides finding images and their captions, a second use for using layout infor-
mation springs to mind: the analysis of tables. In order to identify the tables
in a paper, we decided to take an algorithm from OCR, as already discussed in
the Background section. The T-Recs algorithm was designed to find tables in
scanned pages, but can be adjusted to work in a vector-based environment as
well.
1.6.1
The T-Recs Algorithm
The method presented by Kieninger [28] can be split into three steps. First,
possible table relationships are identified by searching for regular structures in
the layout of the text. Next, some error-correcting methods are employed and
finally, the actual table structure is identified and table content is separated from
non-table content.
In the first step, text units are identified, by melting together words that
overlap horizontally. An overlap is defined as:
 
Search WWH ::




Custom Search