Databases Reference
In-Depth Information
that present problems for the simple version of PageRank.
5.1.1 Early Search Engines and Term Spam
There were many search engines before Google. Largely, they worked by crawl-
ing the Web and listing the terms (words or other strings of characters other
than white space) found in each page, in an inverted index. An inverted index
is a data structure that makes it easy, given a term, to find (pointers to) all the
places where that term occurs.
When a search query (list of terms) was issued, the pages with those terms
were extracted from the inverted index and ranked in a way that reflected the
use of the terms within the page. Thus, presence of a term in a header of
the page made the page more relevant than would the presence of the term in
ordinary text, and large numbers of occurrences of the term would add to the
assumed relevance of the page for the search query.
As people began to use search engines to find their way around the Web,
unethical people saw the opportunity to fool search engines into leading people
to their page. Thus, if you were selling shirts on the Web, all you cared about
was that people would see your page, regardless of what they were looking for.
Thus, you could add a term like “movie” to your page, and do it thousands of
times, so a search engine would think you were a terribly important page about
movies. When a user issued a search query with the term “movie,” the search
engine would list your page first. To prevent the thousands of occurrences of
“movie” from appearing on your page, you could give it the same color as the
background. And if simply adding “movie” to your page didn't do the trick,
then you could go to the search engine, give it the query “movie,” and see what
page did come back as the first choice. Then, copy that page into your own,
again using the background color to make it invisible.
Techniques for fooling search engines into believing your page is about some-
thing it is not, are called term spam. The ability of term spammers to operate
so easily rendered early search engines almost useless. To combat term spam,
Google introduced two innovations:
1. PageRank was used to simulate where Web surfers, starting at a random
page, would tend to congregate if they followed randomly chosen outlinks
from the page at which they were currently located, and this process were
allowed to iterate many times. Pages that would have a large number of
surfers were considered more “important” than pages that would rarely
be visited. Google prefers important pages to unimportant pages when
deciding which pages to show first in response to a search query.
2. The content of a page was judged not only by the terms appearing on that
page, but by the terms used in or near the links to that page. Note that
while it is easy for a spammer to add false terms to a page they control,
they cannot as easily get false terms added to the pages that link to their
own page, if they do not control those pages.
Search WWH ::




Custom Search