Information Technology Reference
In-Depth Information
characteristic metadata of the speci
first release, the aim of the
gateway was to collect a general set of metadata of all documents (document type, a
characteristic date, company, and person). All of these modules are rather time- and
calculation-consuming tasks, requiring DCI (cloud) resources as well as a general
interface to the DCI systems. WS-PGRADE, the DCI-Bridge and gUSE have been
proven as good candidates for this.
c documents. In the
19.3.1 Role of WS-PGRADE/gUSE
WS-PGRADE/gUSE is a complex system for supporting workflows that use grid or
cloud resources in the background. In the case of the eDOX Archiver Gateway such
functionality is desired. The eDOX Gateway can access the gUSE system via the
Remote API component (REST API), providing access to WS-PGRADE/gUSE
workflows. However, the biggest advantage in this case is provided by the DCI
Bridge component. The DCI-Bridge gives a common interface for several grid and
cloud systems, making job submission to any of the supported DCIs straightforward.
The DCI Bridge, together with its CloudBroker plugin, also acts as a resource
broker. The eDOX Archiver Gateway is a commercial product (and its developer is
an SME), which means that, beyond ensuring the expected level of quality, cost-
effectiveness is an important requirement. The management interface of the DCI
Bridge enables setting con
gurations that support this objective.
19.3.2 Architecture of the eDOX Gateway
The eDOX Archiver Gateway supports the digitization and processing of paper-
based documents. The architecture of the gateway is illustrated in Fig. 19.3 .
The eDOX Archiver Gateway document management system (Gateway func-
tional layer) uses a database (database layer) to store contents and metadata of
documents and workflows (document layer). The cloud management system
(interface layer) is provided by SZTAKI to access the chosen cloud service (cloud
functional layer) via the CloudBroker plugin.
The input of the workflow is the scanned, digitized document (either in PDF or
multi-paged TIFF format). This image content is uploaded (via web form, FTP/SCP
etc.) to the portal, and the server side starts processing. The OCR module covers the
following functionalities:
￿
OCRing: recognition of characters (e.g., Asian font sets), and keeping layout
layer (e.g., creating XPath (XML Path Language) based rules for stylesheets,
Spellchecking: post-processing of OCR-enabled content; approximately 200
languages are supported currently.
￿
Search WWH ::




Custom Search