Databases Reference
In-Depth Information
After reading this case study, you'll understand how annotations are used to solve
business problems and how native
XML
databases are unique in their ability to query
text with rich annotations. You'll also become familiar with how open source native
XML
databases use XQuery and Lucene full-text search library functions to create
high-quality search tools.
The Office of the Historian at the Department of State is charged by statute with
publishing the official records associated with US foreign relations. A declassified
analysis of specific periods of US diplomatic history is published in a series of volumes
titled
Foreign Relations of the United States (
FRUS
)
. Through a detailed editing and peer
review process, the Office of the Historian has become the “gold standard” for accu-
racy in the history of international diplomacy.
FRUS
documents are used in political
science and diplomacy classes as well as for other training throughout the world.
In 2008, the Office of the Historian embarked on an initiative to convert the
printed
FRUS
textbooks into an online format that could be easily searched and
viewed using multiple formats. The Office of the Historian chose a standard
XML
for-
mat widely used for encoding historical documents called
Text Encoding Initiative (
TEI
)
.
TEI
was chosen because it has precise
XML
elements to encode a digital representa-
tion of historical documents and includes elements for indicating the people, organi-
zations, locations, dates, and terms used in the documents.
To convert the
FRUS
volumes (each over 1,000 pages long) to
TEI
format, the doc-
uments are first sent to an outside service that enters the information into two sepa-
rate
XML
documents using an
XML
editor. The two
XML
files are compared against
each other to ensure accuracy. The
TEI
-encoded
XML
documents are then returned
to the Office of the Historian ready to be indexed and transformed into
HTML
,
PDF
,
or other formats. Figure 5.9 outlines this encoding process.
HTML search forms
XQuery search service
eXist DB
Encoded in
TEI XML
format with
annotations
Validation
with XML
Schema and
Schematron
Lucene
fulltext
B+tree
Printed
documents
Subversion
Figure 5.9
The overall document workflow for converting printed historical
documents into an online system using TEI encoding. TEI-encoded documents are
validated using XML schemas and Schematron rules files and saved into a
Subversion revision control system. XML documents are then loaded into the
eXist native XML database. Search forms are used to send keyword queries to a
REST XQuery search service. This service uses the eXist document tree indexes
and Lucene indexes to create search results.