Database Reference
In-Depth Information
5.1.1
Early Search Engines and Term Spam
There were many search engines before Google. Largely, they worked by crawling the Web
and listing the terms (words or other strings of characters other than white space) found in
each page, in an inverted index. An inverted index is a data structure that makes it easy,
given a term, to find (pointers to) all the places where that term occurs.
When a search query (list of terms) was issued, the pages with those terms were extrac-
ted from the inverted index and ranked in a way that reflected the use of the terms within
the page. Thus, presence of a term in a header of the page made the page more relevant
than would the presence of the term in ordinary text, and large numbers of occurrences of
the term would add to the assumed relevance of the page for the search query.
As people began to use search engines to find their way around the Web, unethical
people saw the opportunity to fool search engines into leading people to their page. Thus,
if you were selling shirts on the Web, all you cared about was that people would see your
page, regardless of what they were looking for. Thus, you could add a term like “movie” to
your page, and do it thousands of times, so a search engine would think you were a terribly
important page about movies. When a user issued a search query with the term “movie,”
the search engine would list your page first. To prevent the thousands of occurrences of
“movie” from appearing on your page, you could give it the same color as the background.
And if simply adding “movie” to your page didn't do the trick, then you could go to the
search engine, give it the query “movie,” and see what page did come back as the first
choice. Then, copy that page into your own, again using the background color to make it
invisible.
Techniques for fooling search engines into believing your page is about something it is
not, are called term spam . The ability of term spammers to operate so easily rendered early
search engines almost useless. To combat term spam, Google introduced two innovations:
(1) PageRank was used to simulate where Web surfers, starting at a random page, would
tend to congregate if they followed randomly chosen outlinks from the page at which
they were currently located, and this process were allowed to iterate many times. Pages
that would have a large number of surfers were considered more “important” than
pages that would rarely be visited. Google prefers important pages to unimportant
pages when deciding which pages to show first in response to a search query.
(2) The content of a page was judged not only by the terms appearing on that page, but by
the terms used in or near the links to that page. Note that while it is easy for a spammer
to add false terms to a page they control, they cannot as easily get false terms added to
the pages that link to their own page, if they do not control those pages.
Search WWH ::




Custom Search