Using Sphinx with MySQL - High Performance MySQL

Databases Reference

In-Depth Information

By default, Sphinx ranks matches using phrase proximity first and the classic BM25

weight second. This means that verbatim query quotes are guaranteed to be at the very

top, quotes that are off by a single word will be right below those, and so on.

When and how does phrase proximity affect results? Consider searching 1,000,000

pages of text for the phrase “To be or not to be.” Sphinx will put the pages with verbatim

quotes at the very top of the search results, whereas BM25-based systems will first

return the pages with the most mentions of “to,” “be,” “or,” and “not”—pages with

an exact quote match but only a few instances of “to” will be buried deep in the results.

Most major web search engines today rank results with keyword positions as well.

Searching for a phrase on Google will likely result in pages with perfect or near-perfect

phrase matches appearing at the very top of the search results, followed by the “bag of

words” documents.

However, analyzing keyword positions requires additional CPU time, and sometimes

you might need to skip it for performance reasons. There are also cases when phrase

ranking produces undesired, unexpected results. For example, searching for tags in a

cloud is better without keyword positions: it makes no difference whether the tags from

the query are next to each other in the document.

To allow for flexibility, Sphinx offers a choice of ranking modes. Besides the default

mode of proximity plus BM25, you can choose from a number of others that include

BM25-only weighting, fully disabled weighting (which provides a nice optimization if

you're not sorting by rank), and more.

Support for Attributes

Each document might contain an unlimited number of numeric attributes. Attributes

are user-specified and can contain any additional information required for a specific

task. Examples include a blog post's author ID, an inventory item's price, a category

ID, and so on.

Attributes enable efficient full-text searches with additional filtering, sorting, and

grouping of the search results. In theory, they could be stored in MySQL and pulled

from there every time a search is performed. But in practice, if a full-text search locates

even hundreds or thousands of rows (which is not many), retrieving them from MySQL

is unacceptably slow.

Sphinx supports two ways to store attributes: inline in the document lists or externally

in a separate file. Inlining requires all attribute values to be stored in the index many

times, once for each time a document ID is stored. This inflates the index size and

increases I/O, but reduces use of RAM. Storing the attributes externally requires pre-

loading them into RAM upon searchd startup.

Attributes normally fit in RAM, so the usual practice is to store them externally. This

makes filtering, sorting, and grouping very fast, because accessing data is a matter of

Search WWH ::

Custom Search

Home