Databases Reference
In-Depth Information
8.5.1 Matching Bids and Search Queries
As we have described the adwords problem, and as it normally appears in
practice, advertisers bid on sets of words. If a search query occurs having
exactly that set of words in some order, then the bid is said to match the
query, and it becomes a candidate for selection. We can avoid having to deal
with word order by storing all sets of words representing a bid in lexicographic
(alphabetic) order. The list of words in sorted order forms the hash-key for the
bid, and these bids may be stored in a hash table used as an index, as discussed
in Section 1.3.2.
Search queries also have their words sorted prior to lookup. When we hash
the sorted list, we find in the hash table all the bids for exactly that set of words.
They can be retrieved quickly, since we have only to look at the contents of that
bucket.
Moreover, there is a good chance that we can keep the entire hash table in
main memory. If there are a million advertisers, each bidding on 100 queries,
and the record of the bid requires 100 bytes, then we require ten gigabytes of
main memory, which is well within the limits of what is feasible for a single
machine. If more space is required, we can split the buckets of the hash table
among as many machines as we need. Search queries can be hashed and sent
to the appropriate machine.
In practice, search queries may be arriving far too rapidly for a single ma-
chine, or group of machines that collaborate on a single query at a time, to
handle them all. In that case, the stream of queries is split into as many pieces
as necessary, and each piece is handled by one group of machines. In fact, an-
swering the search query, independent of ads, will require a group of machines
working in parallel anyway, in order that the entire processing of a query can
be done in main memory.
8.5.2 More Complex Matching Problems
However, the potential for matching bids to objects is not limited to the case
where the objects are search queries and the match criterion is same-set-of-
words. For example, Google also matches adwords bids to emails. There, the
match criterion is not based on the equality of sets. Rather, a bid on a set of
words S matches an email if all the words in S appear anywhere in the email.
This matching problem is much harder. We can still maintain a hash-table
index for the bids, but the number of subsets of words in a hundred-word email
is much too large to look up all the sets, or even all the small sets of (say)
three or fewer words. There are a number of other potential applications of
this sort of matching that, at the time of this writing, are not implemented
but could be. They all involve standing queries - queries that users post to a
site, expecting the site to notify them whenever something matching the query
becomes available at the site. For example:
1. Twitter allows one to follow all the “tweets” of a given person. However,
Search WWH ::




Custom Search