Using Sphinx with MySQL - High Performance MySQL

Databases Reference

In-Depth Information

Here are the database size and load numbers:

• There are about 150-200 million records, which becomes about 50-100 GB of data

after preprocessing.

• The load is approximately 60,000-100,000 GROUP BY queries per day.

The indexes for the distributed GROUP BY were deployed on the same search cluster of

6 machines and 24 logical CPUs described previously. This is a minor complementary

load to the main search load over the 1.5 TB text database.

Sphinx replaced MySQL's exact, slow, single-CPU computations with approximate,

fast, distributed computations. All of the factors that introduce approximation errors

are present here: the incoming data frequently contains too many rows to fit in the “sort

buffer” (we use a fixed RAM limit of 100K rows), we use COUNT(DISTINCT) , and the

result sets are aggregated over the network. Despite that, the results for the first 10 to

1000 groups—which are actually required for the reports—are from 99% to 100%

correct.

The indexed data is very different from the data that would be used for an ordinary

full-text search. There are a huge number of documents and keywords, even though

the documents are very small. The document numbering is nonsequential, because a

special numbering convention (source server, source table, and primary key) that does

not fit in 32 bits is used. The huge amount of search “keywords” was also causing

frequent CRC32 collisions (Sphinx uses CRC32 to map keywords to internal word IDs).

For these reasons, we were forced to use 64-bit identifiers everywhere internally.

The current performance is satisfactory. For the most complex domains, queries nor-

mally complete in 0.1 to 1.0 seconds.

The following are the lessons learned from this example:

• For GROUP BY queries, some precision can be traded for speed.

• With huge textual collections or moderately sized special collections, 64-bit

identifiers might be required.

Optimizing Sharded JOIN Queries on Grouply.com

Grouply ( http://www.grouply.com ) built a Sphinx-based solution to search its multi-

million-record database of tagged messages, using Sphinx's MVA support. The data-

base is split across many physical servers for massive scalability, so it might be necessary

to query tables that are located on different servers. Arbitrary large-scale joins are im-

possible because there are too many participating servers, databases, and tables.

Grouply uses Sphinx's MVA attributes to store message tags. The tag list is retrieved

from a Sphinx cluster via the PHP API. This replaces multiple sequential SELECTs

from several MySQL servers. To reduce the number of SQL queries as well, certain

Search WWH ::

Custom Search

Home