Databases Reference
In-Depth Information
DocBook is frequently customized for different types of publishing. Each organization
that's publishing a document will select a subset of DocBook elements and then add
their own elements to meet their specific application. For example, a math textbook
might include XML markup for equations (MathML), a chemistry textbook might
include markup for chemical symbols (ChemML), and an economics textbook might
add charts in XML format. These new XML vocabularies can be placed in different
namespaces added to DocBook XML without disrupting the publishing processes.
7.8.2
Retaining document structure in a NoSQL document store
There are several ways to perform search on large collections of DocBook files. The
most straightforward is to strip out all the markup information and send each docu-
ment to Apache Lucene to create a reverse index. Each word would then be associ-
ated with a single document ID . The problem with this approach is that all the
information about the word location within the document is lost. If a word occurs in a
topic or chapter title, it can't be ranked higher than if the word occurs in a biblio-
graphic note.
Ideally, you want to retain the entire document structure and store the XML file in
a native XML database. Then any match within a title can have a higher rank than if
the match occurs within the body of the text.
The first step in creating a search function is to load all the XML documents into a
collection structure. This structure logically groups similar documents and makes it
easy to navigate the documents, similar to a file browser. After the documents have
been loaded, you can run a script to find all unique elements in the document collec-
tion. This is known as an element inventory .
The element inventory is then used as a basis for deciding what elements might
contain information that you want to index for quick searches, and what index types
you'll use. Elements that contain dates might use a range index and elements such as
<title> and <para> that contain full text might use a full-text index.
In addition to the index type, you can also rank the probability that any element
might be a good summary of the concepts
in a section. We call this ranking process
setting the boost values for a document col-
lection. For example, a match on the title of
a chapter will rank higher than a section
title or a glossary keyword. After semantic
weights have been created, a configuration
file is created and the indexing process
begins. Table 7.2 shows an example of these
boost values.
We should note that the boost values are
also stored with the search result indexes so
that they can be used to create precise
Table 7.2 Example of boost values for a
technical topic search site
Element
Boost value
Book title
5.0
Chapter title
4.0
Glossary term
3.0
Indexed term
2.0
Paragraph text
1.0
Bibliographic reference
0.5
 
Search WWH ::




Custom Search