Databases Reference
In-Depth Information
7.8
Case study: searching technical documentation
This case study will look at the problem of searching technical documents. Having a
high-quality search for technical documentation can save you time when you're look-
ing for information. For example. if you're using a complex software package and
need help with a specific function, a high-quality, accurate search can quickly get you
to the right feature.
As you'll see, retaining document structure creates search systems with higher pre-
cision and recall. In the following example, we'll use a specific XML file format called
DocBook, which is ideal for search and retrieval of technical information. You'll see
how Apache Lucene can be integrated directly into a NoSQL database to create high-
quality search. Note that the concepts used in this section are general and can be
applied to formats other than DocBook.
7.8.1
What is technical document search?
Technical document search focuses on helping you quickly find a specific area of interest
in technical documents. For example, you might be looking for a how-to tip in a soft-
ware users' guide, a diagram in a car repair manual, an online help system, or a col-
lege textbook. Technical publications use a process called single-source publishing where
all the output formats, such as web, online help, printed, or EPUB , are all derived
from the same document source format. Figure 7.7 shows an example of how the Doc-
Book XML format stores technical documentation.
DocBook is an XML standard specifically targeting technical publishing. DocBook
defines over 600 elements that are used to store the content of a technical publication
including information about authors, revisions, sections, paragraph text, figures, cap-
tions, tables, glossary tags, and bibliographic information.
A hit in a topic title has a
high search rank score.
Hits in glossary terms may
get a higher boost value.
<book>
<title>Making sense of NoSQL</title>
<chapter>
<title>Finding information with NoSQL search</title>
<sect1>
<title>Returning search hits</title>
<para>A<glossterm>Key Word In Context</glossterm>(KWIC) function
can be used to highlight the keywords in the search hit.</para>
</sect1>
</chapter>
</book>
A hit in a paragraph
has a lower score.
Figure 7.7 A sample of a DocBook XML file. The <title> directly under the
<book> element is the title of the topic. A keyword hit within a topic title has a
higher score than a hit within the body text of the topic.
Search WWH ::




Custom Search