Databases Reference
In-Depth Information
After reading this case study, you'll understand how annotations are used to solve
business problems and how native XML databases are unique in their ability to query
text with rich annotations. You'll also become familiar with how open source native
XML databases use XQuery and Lucene full-text search library functions to create
high-quality search tools.
The Office of the Historian at the Department of State is charged by statute with
publishing the official records associated with US foreign relations. A declassified
analysis of specific periods of US diplomatic history is published in a series of volumes
titled Foreign Relations of the United States ( FRUS ) . Through a detailed editing and peer
review process, the Office of the Historian has become the “gold standard” for accu-
racy in the history of international diplomacy. FRUS documents are used in political
science and diplomacy classes as well as for other training throughout the world.
In 2008, the Office of the Historian embarked on an initiative to convert the
printed FRUS textbooks into an online format that could be easily searched and
viewed using multiple formats. The Office of the Historian chose a standard XML for-
mat widely used for encoding historical documents called Text Encoding Initiative ( TEI ) .
TEI was chosen because it has precise XML elements to encode a digital representa-
tion of historical documents and includes elements for indicating the people, organi-
zations, locations, dates, and terms used in the documents.
To convert the FRUS volumes (each over 1,000 pages long) to TEI format, the doc-
uments are first sent to an outside service that enters the information into two sepa-
rate XML documents using an XML editor. The two XML files are compared against
each other to ensure accuracy. The TEI -encoded XML documents are then returned
to the Office of the Historian ready to be indexed and transformed into HTML , PDF ,
or other formats. Figure 5.9 outlines this encoding process.
HTML search forms
XQuery search service
eXist DB
Encoded in
TEI XML
format with
annotations
Validation
with XML
Schema and
Schematron
Lucene
fulltext
B+tree
Printed
documents
Subversion
Figure 5.9 The overall document workflow for converting printed historical
documents into an online system using TEI encoding. TEI-encoded documents are
validated using XML schemas and Schematron rules files and saved into a
Subversion revision control system. XML documents are then loaded into the
eXist native XML database. Search forms are used to send keyword queries to a
REST XQuery search service. This service uses the eXist document tree indexes
and Lucene indexes to create search results.
 
Search WWH ::




Custom Search