Databases Reference
In-Depth Information
it is feasible to allow users to specify a set of words, such as
ipod free music
and see all the tweets where all these words appear, not necessarily in
order, and not necessarily adjacent.
2. On-line news sites often allow users to select from among certain key-
words or phrases, e.g., “healthcare” or “Barack Obama,” and receive
alerts whenever a new news article contains that word or consecutive
sequence of words. This problem is simpler than the email/adwords prob-
lem for several reasons. Matching single words or consecutive sequences
of words, even in a long article, is not as time-consuming as matching
small sets of words. Further, the sets of terms that one can search for
is limited, so there aren't too many “bids.” Even if many people want
alerts about the same term, only one index entry, with the list of all those
people associated, is required. However, a more advanced system could
allow users to specify alerts for sets of words in a news article, just as the
Adwords system allows anyone to bid on a set of words in an email.
8.5.3 A Matching Algorithm for Documents and Bids
We shall offer an algorithm that will match many “bids” against many “docu-
ments.” As before, a bid is a (typically small) set of words. A document is a
larger set of words, such as an email, tweet, or news article. We assume there
may be hundreds of documents per second arriving, although if there are that
many, the document stream may be split among many machines or groups of
machines. We assume there are many bids, perhaps on the order of a hundred
million or a billion. As always, we want to do as much in main memory as we
can.
We shall, as before, represent a bid by its words listed in some order. There
are two new elements in the representation. First, we shall include a status with
each list of words. The status is an integer indicating how many of the first
words on the list have been matched by the current document. When a bid is
stored in the index, its status is always 0.
Second, while the order of words could be lexicographic, we can lower the
amount of work by ordering words rarest-first. However, since the number
of different words that can appear in emails is essentially unlimited, it is not
feasible to order all words in this manner. As a compromise, we might identify
the n most common words on the Web or in a sample of the stream of documents
we are processing. Here, n might be a hundred thousand or a million. These
n words are sorted by frequency, and they occupy the end of the list, with the
most frequent words at the very end. All words not among the n most frequent
can be assumed equally infrequent and ordered lexicographically. Then, the
words of any document can be ordered. If a word does not appear on the list
of n frequent words, place it at the front of the order, lexicographically. Those
Search WWH ::




Custom Search