Image Processing Reference
In-Depth Information
The CVSEG binarization tests have used four types of binarization algorithms. The exper-
iments showed that the best options are NLBIN (a nonlinear binarization algorithm, recom-
mended by the authors of the Ocropus library [ 18 ] ) and Sauvola's algorithm [ 19 ]. Both al-
gorithms have similar results and are usually affected as described below:
• the binarized image includes large black areas, especially in old scan documents;
• rotated images introduce skew angle estimation errors; these are usually introduced by the
human operator or by the anchoring devices;
• threshold-related problems (usually these are caused by the low contrast or by the quality
of the paper), which cause some details to be eroded and others to be enhanced.
The classification algorithm was then tested only against documents which contained the
problems described above (the segmentation process was 100% accurate). The overall classi-
ication results have been affected as it is presented in Table 3 .
Table 3
Binarization Experiments
Binarization Problem Overall Results (%)
Large black areas 75.5
Skew angle estimation 90.8
Threshold problems
89.1
The most impacting effect is caused by the old document area in the context of the binariz-
ation stage. When the large black areas are introduced, they usually cover the images as well,
not only the text. This causes all the next processing stages (segmentation and classification) to
fail. Figure 6 is affected by this problem—the drawing in the upper area has been completely
covered and the only visible area is the one containing text.
FIGURE 6 Binarization problems.
The next performance drop is caused by the threshold problems, which may erode some
details until they completely vanish. Specifically, in the over-dilated images the small details
are usually merged into black tiles and in the over-eroded images the details composed of thin
line disappear completely. Under these circumstances, the classification algorithm is not be
able to extract the same amount of descriptors therefore the accuracy dropped by 3%.
 
 
Search WWH ::




Custom Search