Databases Reference
In-Depth Information
By default, Sphinx ranks matches using phrase proximity first and the classic BM25
weight second. This means that verbatim query quotes are guaranteed to be at the very
top, quotes that are off by a single word will be right below those, and so on.
When and how does phrase proximity affect results? Consider searching 1,000,000
pages of text for the phrase “To be or not to be.” Sphinx will put the pages with verbatim
quotes at the very top of the search results, whereas BM25-based systems will first
return the pages with the most mentions of “to,” “be,” “or,” and “not”—pages with
an exact quote match but only a few instances of “to” will be buried deep in the results.
Most major web search engines today rank results with keyword positions as well.
Searching for a phrase on Google will likely result in pages with perfect or near-perfect
phrase matches appearing at the very top of the search results, followed by the “bag of
words” documents.
However, analyzing keyword positions requires additional CPU time, and sometimes
you might need to skip it for performance reasons. There are also cases when phrase
ranking produces undesired, unexpected results. For example, searching for tags in a
cloud is better without keyword positions: it makes no difference whether the tags from
the query are next to each other in the document.
To allow for flexibility, Sphinx offers a choice of ranking modes. Besides the default
mode of proximity plus BM25, you can choose from a number of others that include
BM25-only weighting, fully disabled weighting (which provides a nice optimization if
you're not sorting by rank), and more.
Support for Attributes
Each document might contain an unlimited number of numeric attributes. Attributes
are user-specified and can contain any additional information required for a specific
task. Examples include a blog post's author ID, an inventory item's price, a category
ID, and so on.
Attributes enable efficient full-text searches with additional filtering, sorting, and
grouping of the search results. In theory, they could be stored in MySQL and pulled
from there every time a search is performed. But in practice, if a full-text search locates
even hundreds or thousands of rows (which is not many), retrieving them from MySQL
is unacceptably slow.
Sphinx supports two ways to store attributes: inline in the document lists or externally
in a separate file. Inlining requires all attribute values to be stored in the index many
times, once for each time a document ID is stored. This inflates the index size and
increases I/O, but reduces use of RAM. Storing the attributes externally requires pre-
loading them into RAM upon searchd startup.
Attributes normally fit in RAM, so the usual practice is to store them externally. This
makes filtering, sorting, and grouping very fast, because accessing data is a matter of
 
Search WWH ::




Custom Search