Databases Reference
In-Depth Information
Here are the database size and load numbers:
• There are about 150-200 million records, which becomes about 50-100 GB of data
after preprocessing.
• The load is approximately 60,000-100,000 GROUP BY queries per day.
The indexes for the distributed GROUP BY were deployed on the same search cluster of
6 machines and 24 logical CPUs described previously. This is a minor complementary
load to the main search load over the 1.5 TB text database.
Sphinx replaced MySQL's exact, slow, single-CPU computations with approximate,
fast, distributed computations. All of the factors that introduce approximation errors
are present here: the incoming data frequently contains too many rows to fit in the “sort
buffer” (we use a fixed RAM limit of 100K rows), we use COUNT(DISTINCT) , and the
result sets are aggregated over the network. Despite that, the results for the first 10 to
1000 groups—which are actually required for the reports—are from 99% to 100%
correct.
The indexed data is very different from the data that would be used for an ordinary
full-text search. There are a huge number of documents and keywords, even though
the documents are very small. The document numbering is nonsequential, because a
special numbering convention (source server, source table, and primary key) that does
not fit in 32 bits is used. The huge amount of search “keywords” was also causing
frequent CRC32 collisions (Sphinx uses CRC32 to map keywords to internal word IDs).
For these reasons, we were forced to use 64-bit identifiers everywhere internally.
The current performance is satisfactory. For the most complex domains, queries nor-
mally complete in 0.1 to 1.0 seconds.
The following are the lessons learned from this example:
• For GROUP BY queries, some precision can be traded for speed.
• With huge textual collections or moderately sized special collections, 64-bit
identifiers might be required.
Optimizing Sharded JOIN Queries on Grouply.com
Grouply ( http://www.grouply.com ) built a Sphinx-based solution to search its multi-
million-record database of tagged messages, using Sphinx's MVA support. The data-
base is split across many physical servers for massive scalability, so it might be necessary
to query tables that are located on different servers. Arbitrary large-scale joins are im-
possible because there are too many participating servers, databases, and tables.
Grouply uses Sphinx's MVA attributes to store message tags. The tag list is retrieved
from a Sphinx cluster via the PHP API. This replaces multiple sequential SELECTs
from several MySQL servers. To reduce the number of SQL queries as well, certain
 
Search WWH ::




Custom Search