Database Reference
In-Depth Information
elements, so every piece of text is indexed multiple times. Depending on how deeply
the text is nested in the document, this may be slow and create a huge number of
index files.
So, the best strategy for full-text indexes is to define them as narrowly as you can.
And be careful using wildcards, because they can quickly get out of hand!
Handling Mixed Content
You can decide how to handle mixed content by using the
inline
and
ignore
ele‐
ments. These elements can appear globally (as children of the
lucene
element) or per
index (as children of the
text
element).
inline
also has an effect on how Lucene
treats whitespace. They have the following format:
<inline qname =
string
/>
<ignore qname =
string
/>
qname
holds the qualified name (with an optional namespace prefix) of the inline ele‐
ment.
Inline content and whitespace
By default, Lucene treats inline elements as token separators, which may or may not
be what you want. For instance, assume we have an XML fragment like:
<p>
This is
<b>
un
</b>
clear.
</p>
Because of the
b
inline element, Lucene will see this as
"This is un clear."
(notice
the space between
un
and
clear
)—probably not what you intended! To address this,
use an index definition like:
<lucene>
<text
qname=
"p"
>
<inline
qname=
"b"
/>
</text>
</lucene>
Or, if the
b
element is always an inline element in all other elements of the collections
documents:
<lucene>
<text
qname=
"p"
/>
<!-- other text indexes -->
<inline
qname=
"b"
/>
</lucene>
Search WWH ::
Custom Search