Link Analysis - Mining of Massive Datasets

Databases Reference

In-Depth Information

that present problems for the simple version of PageRank.

5.1.1 Early Search Engines and Term Spam

There were many search engines before Google. Largely, they worked by crawl-

ing the Web and listing the terms (words or other strings of characters other

than white space) found in each page, in an inverted index. An inverted index

is a data structure that makes it easy, given a term, to find (pointers to) all the

places where that term occurs.

When a search query (list of terms) was issued, the pages with those terms

were extracted from the inverted index and ranked in a way that reflected the

use of the terms within the page. Thus, presence of a term in a header of

the page made the page more relevant than would the presence of the term in

ordinary text, and large numbers of occurrences of the term would add to the

assumed relevance of the page for the search query.

As people began to use search engines to find their way around the Web,

unethical people saw the opportunity to fool search engines into leading people

to their page. Thus, if you were selling shirts on the Web, all you cared about

was that people would see your page, regardless of what they were looking for.

Thus, you could add a term like “movie” to your page, and do it thousands of

times, so a search engine would think you were a terribly important page about

movies. When a user issued a search query with the term “movie,” the search

engine would list your page first. To prevent the thousands of occurrences of

“movie” from appearing on your page, you could give it the same color as the

background. And if simply adding “movie” to your page didn't do the trick,

then you could go to the search engine, give it the query “movie,” and see what

page did come back as the first choice. Then, copy that page into your own,

again using the background color to make it invisible.

Techniques for fooling search engines into believing your page is about some-

thing it is not, are called term spam. The ability of term spammers to operate

so easily rendered early search engines almost useless. To combat term spam,

Google introduced two innovations:

1. PageRank was used to simulate where Web surfers, starting at a random

page, would tend to congregate if they followed randomly chosen outlinks

from the page at which they were currently located, and this process were

allowed to iterate many times. Pages that would have a large number of

surfers were considered more “important” than pages that would rarely

be visited. Google prefers important pages to unimportant pages when

deciding which pages to show first in response to a search query.

2. The content of a page was judged not only by the terms appearing on that

page, but by the terms used in or near the links to that page. Note that

while it is easy for a spammer to add false terms to a page they control,

they cannot as easily get false terms added to the pages that link to their

own page, if they do not control those pages.

Search WWH ::

Custom Search

Home