Krugle Code Search Architecture - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

1. The first step is to load the URL State database with an initial set of URLs. These

can be a broad set of top-level domains such as the 1.7 million web sites with the

highest US-based traffic, or the results from selective searches against another

index, or manually selected URLs that point to specific, high quality pages.

2. Once the URL State database has been loaded with some initial URLs, the first

loop in the focused crawl can begin. The first step in each loop is to extract all of

the unprocessed URLs, and sort them by their link score.

3. Next comes one of the two critical steps in the workflow. A decision is made

about how many of the top-scoring URLs to process in this loop. The fewer the

number, the “tighter” the focus of the crawl. There are many options for decid-

ing how many URLs to accept—for example, based on a fixed minimum score,

a fixed percentage of all URLs, or a maximum count. More sophisticated ap-

proaches include picking a cutoff score that represents the transition point (el-

bow) in a power curve.

4. Once the set of accepted URLs has been created, the standard fetch process be-

gins. This includes all of the usual steps required for polite and efficient fetching,

such as robots.txt processing. Pages that are successfully fetched can then be

parsed.

5. Typically fetched pages are also saved into the Fetched Pages database.

6. Now comes the second of the two critical steps. The parsed page content is

given to the page scorer, which returns a value representing how closely the page

matches the focus of the crawl. Typically this is a value from 0.0 to 1.0, with

higher scores being better.

7. Once the page has been scored, each outlink found in the parse is extracted.

8. The score for the page is divided among all of the outlinks.

9. Finally, the URL State database is updated with the results of fetch attempts

(succeeded, failed), all newly discovered URLs are added, and any existing URLs

get their link score increased by all matching outlinks that were extracted during

this loop.

At this point the focused crawl can terminate, if sufficient pages of high enough

quality (score) have been found, or the next loop can begin.

In this manner the crawl proceeds in a depth-first manner, focusing on areas of

the web graph where the most high scoring pages are found.

In the end we wound up with about 50 million pages, and a “crawlDB” that

contained around 250 million URLs, of which about half were scored high enough

such that we would eventually want to crawl them.

6.3.2 Web Page Processing

Once we had fetched a web page (or document, such as a PDF) then we'd parse it,

to extract the title and text. Again, we leveraged the support that was already there

in Nutch.

Search WWH ::

Custom Search

Home