Database Reference
In-Depth Information
a site, expecting the site to notify them whenever something matching the query becomes
available at the site. For example:
(1) Twitter allows one to follow all the “tweets” of a given person. However, it is feasible
to allow users to specify a set of words, such as
ipod free music
and see all the tweets where all these words appear, not necessarily in order, and not
necessarily adjacent.
(2) On-line news sites often allow users to select from among certain keywords or phrases,
e.g., “healthcare” or “Barack Obama,” and receive alerts whenever a new news article
contains that word or consecutive sequence of words. This problem is simpler than the
email/adwords problem for several reasons. Matching single words or consecutive se-
quences of words, even in a long article, is not as time-consuming as matching small
sets of words. Further, the sets of terms that one can search for is limited, so there
aren't too many “bids.” Even if many people want alerts about the same term, only one
index entry, with the list of all those people associated, is required. However, a more
advanced system could allow users to specify alerts for sets of words in a news article,
just as the Adwords system allows anyone to bid on a set of words in an email.
8.5.3
A Matching Algorithm for Documents and Bids
We shall offer an algorithm that will match many “bids” against many “documents.” As
before, a bid is a (typically small) set of words. A document is a larger set of words, such
as an email, tweet, or news article. We assume there may be hundreds of documents per
second arriving, although if there are that many, the document stream may be split among
many machines or groups of machines. We assume there are many bids, perhaps on the or-
der of a hundred million or a billion. As always, we want to do as much in main memory
as we can.
We shall, as before, represent a bid by its words listed in some order. There are two new
elements in the representation. First, we shall include a status with each list of words. The
status is an integer indicating how many of the first words on the list have been matched by
the current document. When a bid is stored in the index, its status is always 0.
Second, while the order of words could be lexicographic, we can lower the amount of
work by ordering words rarest-first. However, since the number of different words that can
appear in emails is essentially unlimited, it is not feasible to order all words in this manner.
As a compromise, we might identify the n most common words on the Web or in a sample
of the stream of documents we are processing. Here, n might be a hundred thousand or a
Search WWH ::




Custom Search