Advertising on the Web - Mining of Massive Datasets

Databases Reference

In-Depth Information

it is feasible to allow users to specify a set of words, such as

ipod free music

and see all the tweets where all these words appear, not necessarily in

order, and not necessarily adjacent.

2. On-line news sites often allow users to select from among certain key-

words or phrases, e.g., “healthcare” or “Barack Obama,” and receive

alerts whenever a new news article contains that word or consecutive

sequence of words. This problem is simpler than the email/adwords prob-

lem for several reasons. Matching single words or consecutive sequences

of words, even in a long article, is not as time-consuming as matching

small sets of words. Further, the sets of terms that one can search for

is limited, so there aren't too many “bids.” Even if many people want

alerts about the same term, only one index entry, with the list of all those

people associated, is required. However, a more advanced system could

allow users to specify alerts for sets of words in a news article, just as the

Adwords system allows anyone to bid on a set of words in an email.

8.5.3 A Matching Algorithm for Documents and Bids

We shall offer an algorithm that will match many “bids” against many “docu-

ments.” As before, a bid is a (typically small) set of words. A document is a

larger set of words, such as an email, tweet, or news article. We assume there

may be hundreds of documents per second arriving, although if there are that

many, the document stream may be split among many machines or groups of

machines. We assume there are many bids, perhaps on the order of a hundred

million or a billion. As always, we want to do as much in main memory as we

can.

We shall, as before, represent a bid by its words listed in some order. There

are two new elements in the representation. First, we shall include a status with

each list of words. The status is an integer indicating how many of the first

words on the list have been matched by the current document. When a bid is

stored in the index, its status is always 0.

Second, while the order of words could be lexicographic, we can lower the

amount of work by ordering words rarest-first. However, since the number

of different words that can appear in emails is essentially unlimited, it is not

feasible to order all words in this manner. As a compromise, we might identify

the n most common words on the Web or in a sample of the stream of documents

we are processing. Here, n might be a hundred thousand or a million. These

n words are sorted by frequency, and they occupy the end of the list, with the

most frequent words at the very end. All words not among the n most frequent

can be assumed equally infrequent and ordered lexicographically. Then, the

words of any document can be ordered. If a word does not appear on the list

of n frequent words, place it at the front of the order, lexicographically. Those

Search WWH ::

Custom Search

Home