Textual Images (Digital Library) Part 1

Plain text documents in digital libraries are often produced by digitizing paper documents. Digitization is the process of taking traditional library materials, typically in the form of books and papers, and converting them to electronic form, which can be stored and manipulated by a computer. Digitizing a large collection is a time-consuming and expensive process that should not be undertaken lightly.

Digitizing proceeds in two stages, illustrated in Figure 4.2. The first stage produces a digitized image of each page using a process known as scanning. The second stage produces a digital representation of the textual content of the pages using optical character recognition (OCR). In many digital library systems, what is presented to library readers is the result of the scanning stage: page images, electronically delivered. The OCR stage is necessary if a full-text index is to be built that will allow searchers to locate any combination of words, or if any automatic metadata extraction technique is contemplated, such as identifying document titles by seeking them in the text. Sometimes the second stage is omitted, but full-text search is then impossible, which negates a prime advantage of digital libraries.

If, as is usually the case, OCR is undertaken, the result can be used as an alternative way of displaying the page contents. The display will be more attractive if the OCR system not only is able to interpret the text in the page image, but also can retain the page layout as well.

Figure 4.2: Scanning and optical character recognition

Whether it is a good idea to display OCR output depends on how well the page content and format are captured by the OCR process, among other things.

Scanning

The result of the first stage, scanning, is a digitized image of each page. The image resembles a digital photograph, although its picture elements or pixels may be either black or white—whereas photos have pixels that come in color, or at least in different shades of gray. Text is well represented in black and white, but if the image includes nontextual material, such as pictures, or exhibits artifacts like coffee stains or creases, grayscale or color images will resemble the original pages more closely. Image digitization is discussed more fully in the next topic.

When scanning page images you need to decide whether to use black-and-white, grayscale, or color, and you also need to determine the resolution of the digitized images—that is, the number of pixels per linear unit. A familiar example of black-and-white image resolution is the ubiquitous laser printer, which generally prints 600-1200 dots per inch. Table 4.2 shows the resolution of several common imaging devices.

The number of bits used to represent each pixel also helps to determine image quality. Most printing devices are black and white: one bit is allocated to each pixel. When putting ink on paper, this representation is natural—a pixel is either inked or not. However, display technology is more flexible, and computer screens allow several bits per pixel. Color displays range up to 24 bits per pixel, encoded as 8 bits for each of the colors red, green, and blue, or even 32 bits per pixel, encoded in a way that separates the chromatic, or color, information from the achromatic, or brightness, information. Color scanners can be used to capture images having more than 1 bit per pixel.

More bits per pixel can compensate for a lack of linear resolution and vice versa. Research on human perception has shown that if a dot is small enough, its brightness and size are interchangeable—that is, a small bright dot cannot be distinguished from a larger, dimmer one. The critical size below which this phenomenon takes effect depends on the contrast between dots and their background, but corresponds roughly to a very low-resolution (640 x 480) display at normal viewing levels and distances.

Table 4.2: an assortment of devices and their resolutions

Device	Resolution (dpi)	Depth (bits)
Laptop computer screen (17 inch diagonal, 1680 x 1050 resolution)	116 x 116	24-32
Fax machine	200 x 200	1
Scanner	600 x 600	24
Laser printer	600 x 600 – 1200 x 1200	1
Phototypesetter	4800 x 4800	1

When digitizing documents for a digital library, think about what you want the user to be able to see. How closely does it need to resemble the original document pages? Are you concerned about preserving artifacts? What about pictures in the text? Will users see one page on the screen at a time? Will they want to magnify the images?

You will need to obtain scanned versions of several sample pages, chosen to cover the kinds and quality of images in the collection, and digitized to a range of different qualities (e.g., different resolutions, different gray levels, and color versus monochrome). You should conduct trials with end users of the digital library to determine what qualities are necessary for actual use.

It is always tempting to say that quality should be as high as it possibly can be. But there is a cost: the downside of accurate representation is increased storage space on the computer and—probably more importantly—increased time required for page access by users, particularly remote users. Doubling the linear resolution quadruples the number of pixels, and although this increase is ameliorated by compression techniques, users still pay a toll in access time. Your trials should take place on typical computer configurations using typical communications facilities, so that you can assess the effect of download time as well as image quality. You might also consider generating thumbnail images, or images at several different resolutions, or using a "progressive refinement" form of image transmission (see Section 5.3), so that users who need high-quality pictures can be sure they’ve got the right one before embarking on a lengthy download.

Optical character recognition

The second stage of digitizing library material is to transform the scanned image into a digitized representation of the page content—in other words, a character-by-character representation rather than a pixel-by-pixel one. This is known as optical character recognition (OCR). Although the OCR process itself can be entirely automatic, subsequent manual cleanup is invariably necessary and is usually the most expensive and time-consuming operation involved in creating a digital library from printed material. OCR might be characterized as taking "dumb" page images that are nothing more than images and producing "smart" electronic text that can be searched and processed in many different ways.

As a rule of thumb, a resolution of 300 dpi is needed to support OCR of regular fonts (10-point or greater), and 400 to 600 dpi for smaller fonts (9-point or less). Many OCR programs can tune the brightness of grayscale images appropriately for the text being recognized, so grayscale scanning tends to yield better results than black-and-white scanning. However, black-and-white images generate much smaller files than grayscale ones.

Not surprisingly, the quality of the output of an OCR program depends critically on the quality of the input. With clear, well-printed English, on clean pages, in ordinary fonts, digitized to an adequate resolution, laid out on the page in the normal way, with no tables, images, or other nontextual material, a leading OCR engine is likely to be 99.9 percent accurate or above—say 1 to 4 errors per 2,000 characters, which is a little under a page of this topic. Accuracy continues to increase, albeit slowly, as technology improves. Replicating the exact format of the original image is more difficult, although for simple pages an excellent approximation will be achieved.

Unfortunately, the OCR operation is rarely presented with favorable conditions. Problems occur with proper names, with foreign names and words, and with special terminology—like Latin names for biological species. Problems are incurred with strange fonts, and particularly with alphabets that have accents or diacritical marks, or non-Roman characters. Problems are generated by all kinds of mathematics, by small type or smudgy print, and by overly dark characters that have smeared or bled together or overly light ones whose characters have broken up. OCR has problems with tightly packed or loosely set text where, to justify the margins, character and word spacing diverge widely from the norm. Hand annotation interferes with print, as does water-staining, or extraneous marks like coffee stains or squashed insects. Multiple columns, particularly when set close together, are difficult. Other problems are caused by any kind of pictures or images—particularly ones that contain some text; by tables, footnotes, and other floating material; by unusual page layouts; and by text in the image that is skewed, or lines of text that are bowed from the attempt to place book pages flat on the scanner platen, or by the book’s binding if it interferes with the scanned text. These problems may sound arcane, but almost all OCR projects encounter them.

The highest and most expensive level of accuracy attainable from commercial service bureaus is typically 99.995 percent, or 1 error in 20,000 characters of text (approximately six pages of this topic). Such a level is often most easily achieved by having the text retyped manually rather than by having it processed automatically by OCR. Each page is processed twice, by different operators, and the results are compared automatically. Any discrepancies are resolved manually.

As a rule of thumb, OCR becomes less efficient than manual keying when its accuracy rate drops below 95 percent. Moreover, once the initial OCR pass is complete, costs tend to double with each halving of the accuracy rate. However, in a large digitization project, errors are usually non-uniformly distributed over pages: often 80 percent of errors come from 20 percent of the page images. It may be worthwhile to have the worst of the pages manually keyed and to perform OCR on the remainder.

Human intervention is often valuable for cleaning up both the image before OCR and, afterward, the text produced by OCR. The actual recognition part can be time-consuming—maybe one or two minutes per page—and it is useful to perform interactive preprocessing for a batch of pages, have them recognized offline, and return them to the batch for interactive cleanup. Careful attention to such practical details can make a great deal of difference in a large-scale project.

Interactive OCR involves six steps: image acquisition, cleanup, page analysis, recognition, checking, and saving.

Acquisition, cleanup, and page analysis

Images are acquired either by inputting them from a document scanner or by reading a file that contains predigitized images. In the former case, the document is placed on the scanner platen and the program produces a digitized image. Most digitization software can communicate with a wide variety of image acquisition devices. An OCR program may be able to scan a batch of several pages and let you work interactively on the other steps afterward. This is particularly useful if you have an automatic document feeder.

The cleanup stage applies image-processing operations to the image. For example, a despeckle filter cleans up isolated pixels or "pepper and salt" noise. It may be necessary to rotate the image by 90 or 180 degrees, or to automatically calculate a skew angle and deskew the image by rotating it back by that angle. Images may be converted from white-on-black to the standard black-on-white representation, and double-page spreads may be converted to single-image pages. These operations may be invoked manually or automatically. If you don’t want to recognize certain parts of the image, or if it contains large artifacts—such as photocopied parts of the document’s binding—you may need to remove them manually by selecting the unwanted area and clearing it.

The page analysis stage examines the layout of the page and determines which parts to process and in what order. Again, page analysis can be either manual or automatic. The result divides the page into blocks of different types, typically text blocks, which will be interpreted as ordinary running text, table blocks, which will be further processed to analyze the layout before reading each table cell, and picture blocks, which will be ignored in the character recognition stage. During page analysis, multicolumn text layouts are detected and are sorted into correct reading order.

Figure 4.3a shows an example of a scanned document with regions that contain different types of data: text, two graphics, and a photographic image. In Figure 4.3b, bounding boxes have been drawn (manually in this case) around these regions. This particular layout is interesting because it contains a region—the large text block halfway down the left-hand column—that is clearly nonrectangular, and another region—the halftone photograph—that is tilted. Because layouts like this present significant challenges to automatic page analysis algorithms, many interactive OCR systems show users the result of automatic page analysis and offer the option of manually overriding it.

It is also useful to be able to manually set up a template that applies to a whole batch of pages. For example, you might define header and footer regions, and specify that each page contains a double column of text—perhaps even give the bounding boxes of the columns. Perhaps the page analysis process can be circumvented by specifying in advance that all pages contain single-column running text, without headers, footers, pictures, or tables. Finally, although word spacing is usually ignored, in some cases spaces may be significant—as in formatted computer programs.

Tables are particularly difficult for page analysis. For each table, the user may be able to specify interactively such things as whether the table has one line per entry or contains multiline cells, and whether the number of columns is the same throughout or some rows contain merged cells. As a last resort, it may be necessary for the user to specify every row and column manually.

Recognition

The recognition stage reads the characters on the page. This is the actual "OCR" part. One parameter that may need to be specified is the typeface (e.g., regular typeset text, fixed-width typewriter print, or dot-matrix characters). Another is the alphabet or character set, which is determined by the language used. Most OCR packages deal with only the Roman alphabet, although some accept Cyrillic, Greek, and Czech as well. Recognition of Arabic text, the various Indian scripts, or ideographic languages like Chinese and Korean calls for specialist software.

Summary

Diagnosis of physical systems such as car or aircraft engines is a complex activity, Technicians combine textual manuals with schematics and some analysis of measured data to diagnose and repair engines. Knowledge-based system designers have added heuristics to link text and graphic (hypermedia) representations of manuals to simplify the tasks of the technicians as implemented in JETA (Halasz92|. Knowledge browsers are used by the developers of such systems to structure and input the knowledge base and are also used in a limited capacity to help the domain experts visualize the knowledge and the various possible relations. Such a browser has been implemented to view and edit JBTA’s knowledge. In RATIONALE, a diagnostic .system that reasons by explaining, explanation is used to understand system reasoning | Abu-Hakima90], This paper argues thai although these knowledge-based approaches help in the visualization and understanding of diagnoses in physical systems, they need to be improved and belter integrated.

Figure I: World of user in knowledge-based system

Introduction

Diagnosis of physical systems such as car or aircraft engines is a complex activity. Technicians combine paper manuals with schematics and some analysis of measured data to diagnose and repair engines. Knowledge-based systems provide the technicians with electronic manuals organized using hypermedia techniques as well as diagnostic hierarchies thai represent the failure, test and repair actions of the diagnostic cycle. Such an approach has been followed for JETA. the Jet Engine Troubleshooting Assistant [Halasz92j. Other systems have followed modelling and simulation techniques thai represent the actual physical system and attempt to diagnose it on ihe basis of the expected behaviour of" the model j,V!BR91 ]. Some diagnostic systems example, uses only 15.

Documents in German include an additional character, 6 or scharfes s, which is unique because, unlike all other German letters, it exists only in lowercase. (A recent change in the official definition of the German language has replaced some, but not all, occurrences of 6 by ss.) European languages use accents: the German umlaut (u); the French acute (e), grave (a), circumflex (o), and cedilla (f); the Spanish tilde (n). Documents may, of course, be multilingual.

For certain document types it may help to create a new "language" to restrict the characters that can be recognized. For example, a particular set of documents may be all in uppercase, or consist of nothing but numbers and associated punctuation.

In some OCR systems, the recognition engine can be trained to attune it to the peculiarities of the documents being read. Training may be helpful if the text includes decorative fonts or special characters like mathematical symbols. It may also be useful for recognition of large batches of text (100 pages or more) in which the print quality is low.

For example, the letters in some particular character sequences may have bled or smudged together on the page so that they cannot be separated by the OCR system’s segmentation mechanism. In typographical parlance they form a ligature: a combination of two or three characters set as a single glyph—such as fi, B, and ffl in the font in which this topic is printed. Although OCR systems recognize standard ligatures as a matter of course, printing occasionally contains unusual ligatures, as when particular sequences of two or three characters are systematically joined together. In these cases it may be helpful to train the system to recognize each combination as a single unit.

Training is accomplished by making the system process a page or two of text in a special training mode. When unrecognized characters are encountered, the user can enter them as new patterns. It may first be necessary to adjust the bounding box to include the whole pattern and exclude extraneous fragments of other characters. Recognition accuracy will improve if several examples of each pattern are supplied. When naming the pattern, its font properties (italic, bold, small capitals, subscript, superscript) may need to be specified along with the actual characters that comprise the pattern.

There is a limit to the amount of extra accuracy that training can achieve. OCR still does not perform well with more stylized typefaces, such as Gothic, that are significantly different from modern ones—and training may not help much.

Obviously, better results can be obtained if a language dictionary is used. It is far easier to distinguish letters like o, 0, O, and Q in the context of the words in which they occur. Most OCR systems include predefined dictionaries and are able to use domain-specific ones containing technical terms, common names, abbreviations, product codes, and the like. Particular words may be constrained to particular styles of capitalization. Regular words may appear with or without an initial capital letter and may also be written in all capitals. Proper names must begin with a capital letter (and may be written in all capitals too). Some acronyms are always capitalized, while others may be capitalized in fixed but arbitrary ways.

Just as the language determines the basic alphabet, it may also preclude many letter combinations. Such information can greatly constrain the recognition process, and some OCR systems let users provide it.