Information Technology Reference
In-Depth Information
From the examples in Figures 8.1 and 8.2, it can be seen that the OCR process
introduces a significant degree of noise into the textual data that all further pro-
cesses are operating on. We have not undertaken experiments specifically designed
to evaluate the degradation in accuracy of either classification or separation that
this OCR noise induces. Such experiments could be set up to work from cleaned-up
OCR text. Since this implies a large amount of manual labor, the change in docu-
ment quality could be simulated by printing electronic documents and manipulating
the pages, e.g., by copying them repeatedly.
T able 8.2. Some of the features related to the stem borro w
Token
#Occur Token
#Occur Token
#Occur
borrnu
1 borronv
4 borrovv
8
borrnwer
1 borrotr
1 borrovvcr
1
borro
92 borrou
3 borrovvef
1
borroa
1 borrov
14 borrovvei
1
borroaer
1 borrovc
1 borrovvfir
1
borroh
4 borrovcf
1 borrovvi
1
borroi
1 borrovd
1 borrovw
1
borroifril
1 borrovi
3
borrojv
1 borrovj
3
borrokbr
1 borrovjar
1
borrom
2 borrovl
1
borromad
1 borrovrti
1
borromicrl
1 borrovt
1
borromr
1 borrovti
1
borron
1 borrovu
1
OCR noise also affects the size of training sets negatively. Under the bag-of-words
model, the text for each page is converted into a feature vector with a dimension-
ality equal to the number of distinct words (or stems) in the training corpus. Noise
introduced during OCR multiplies this number by generating many seemingly dis-
tinct, spurious words. Table 8.2 shows a small number of features related to the
stem “borrow.”
For some data sets, the number of OCR-induced variations becomes so high
that the size of the training set exceeds reasonable memory sizes (e.g., > 2 GB).
In those cases, we apply a preliminary feature selection step that removes features
with low occurrence counts until we arrive at a small enough feature set. In general,
though, we prefer to keep all features available for the classification mechanism and
not perform any initial feature selection. Only in cases when size or performance
require it, we apply feature selection to reduce the size of the feature set. We use
basic frequency filtering and information-gain or mutual information as selection
means.
 
Search WWH ::




Custom Search