Using Layout Data for the Analysis of Scientific Literature - Mining Complex Data

Information Technology Reference

In-Depth Information

Raw data

Full page

Model

Logo

Fig. 1.4. Sample pictures from the neurological data set. The category is written below

the image.

In a neurological test set, we used four categories: full pages, raw data, model

and logo. In the biological test set there were no logo, hence there were only 3

categories. The distribution was quite different, instead of a large number of full

pages, we had many logos. We estimate the error rate of this set in fact higher

than on the biological data set, because of the increased number of categories.

Unlike in the biological data set the distribution is much more biased with only

13 instances of full pages and over 1000 models.

1.5.1

Method for Classification

For image classification, a feature-based approach seems best, because we do not

classify based on the object seen in the image, but on the representation of that

object, e.g. gel blots vs. graph points. Other algorithms, like the random window

approach, tend to repress those representation details. We base our method on

[26], a method originally used to distinguish between computer-made images and

real life photos, since that is a closely related problem.

In order to classify the pictures, we calculate 6 metrics or features based on

the picture. The calculations for the metrics are all linear, so the calculation

takes less than a second for an average picture. The small number of attributes

allows fast learning and classification. An information gain estimate is given in

table 1.1. The features are explained below. Also included is an interpretation

of how useful these features were to our task.

1. Number of Colours : counts the number of occurring colours in the pic-

ture. We assume that many colours indicate slow colour changes typical for

photos of experimental results, while graphs are usually black and white.

2. Contour Sharpness : measures the occurrence of hard changes in the

colour values. First, it compares each pixel with its neighbouring pixels, to find

the biggest colour difference between them. Then, all pixels with a maximum

difference bigger than 0 are counted as S and those bigger then a threshold t

Search WWH ::

Custom Search

Home