Finding information with NoSQL search - Making Sense of NoSQL

Databases Reference

In-Depth Information

DocBook is frequently customized for different types of publishing. Each organization

that's publishing a document will select a subset of DocBook elements and then add

their own elements to meet their specific application. For example, a math textbook

might include XML markup for equations (MathML), a chemistry textbook might

include markup for chemical symbols (ChemML), and an economics textbook might

add charts in XML format. These new XML vocabularies can be placed in different

namespaces added to DocBook XML without disrupting the publishing processes.

7.8.2

Retaining document structure in a NoSQL document store

There are several ways to perform search on large collections of DocBook files. The

most straightforward is to strip out all the markup information and send each docu-

ment to Apache Lucene to create a reverse index. Each word would then be associ-

ated with a single document ID . The problem with this approach is that all the

information about the word location within the document is lost. If a word occurs in a

topic or chapter title, it can't be ranked higher than if the word occurs in a biblio-

graphic note.

Ideally, you want to retain the entire document structure and store the XML file in

a native XML database. Then any match within a title can have a higher rank than if

the match occurs within the body of the text.

The first step in creating a search function is to load all the XML documents into a

collection structure. This structure logically groups similar documents and makes it

easy to navigate the documents, similar to a file browser. After the documents have

been loaded, you can run a script to find all unique elements in the document collec-

tion. This is known as an element inventory .

The element inventory is then used as a basis for deciding what elements might

contain information that you want to index for quick searches, and what index types

you'll use. Elements that contain dates might use a range index and elements such as

<title> and <para> that contain full text might use a full-text index.

In addition to the index type, you can also rank the probability that any element

might be a good summary of the concepts

in a section. We call this ranking process

setting the boost values for a document col-

lection. For example, a match on the title of

a chapter will rank higher than a section

title or a glossary keyword. After semantic

weights have been created, a configuration

file is created and the indexing process

begins. Table 7.2 shows an example of these

boost values.

We should note that the boost values are

also stored with the search result indexes so

that they can be used to create precise

Table 7.2 Example of boost values for a

technical topic search site

Element

Boost value

Book title

5.0

Chapter title

4.0

Glossary term

3.0

Indexed term

2.0

Paragraph text

1.0

Bibliographic reference

0.5

Making Sense of NoSQL

Search WWH ::

Custom Search

Home