Link Analysis - Mining of Massive Datasets

Database Reference

In-Depth Information

5.1.1

Early Search Engines and Term Spam

There were many search engines before Google. Largely, they worked by crawling the Web

and listing the terms (words or other strings of characters other than white space) found in

each page, in an inverted index. An inverted index is a data structure that makes it easy,

given a term, to find (pointers to) all the places where that term occurs.

When a search query (list of terms) was issued, the pages with those terms were extrac-

ted from the inverted index and ranked in a way that reflected the use of the terms within

the page. Thus, presence of a term in a header of the page made the page more relevant

than would the presence of the term in ordinary text, and large numbers of occurrences of

the term would add to the assumed relevance of the page for the search query.

As people began to use search engines to find their way around the Web, unethical

people saw the opportunity to fool search engines into leading people to their page. Thus,

if you were selling shirts on the Web, all you cared about was that people would see your

page, regardless of what they were looking for. Thus, you could add a term like “movie” to

your page, and do it thousands of times, so a search engine would think you were a terribly

important page about movies. When a user issued a search query with the term “movie,”

the search engine would list your page first. To prevent the thousands of occurrences of

“movie” from appearing on your page, you could give it the same color as the background.

And if simply adding “movie” to your page didn't do the trick, then you could go to the

search engine, give it the query “movie,” and see what page did come back as the first

choice. Then, copy that page into your own, again using the background color to make it

invisible.

Techniques for fooling search engines into believing your page is about something it is

not, are called term spam . The ability of term spammers to operate so easily rendered early

search engines almost useless. To combat term spam, Google introduced two innovations:

(1) PageRank was used to simulate where Web surfers, starting at a random page, would

tend to congregate if they followed randomly chosen outlinks from the page at which

they were currently located, and this process were allowed to iterate many times. Pages

that would have a large number of surfers were considered more “important” than

pages that would rarely be visited. Google prefers important pages to unimportant

pages when deciding which pages to show first in response to a search query.

(2) The content of a page was judged not only by the terms appearing on that page, but by

the terms used in or near the links to that page. Note that while it is easy for a spammer

to add false terms to a page they control, they cannot as easily get false terms added to

the pages that link to their own page, if they do not control those pages.

Search WWH ::

Custom Search

Home