Java Reference
In-Depth Information
this.cancel = false;
this.startTime = new Date();
The spider then loops until there are no more hosts to process. If at least one URL has
been added, there will be at least one host to process. You should always add at least one URL
to the spider; otherwise, it has no work to do.
do {
processHost();
} while (this.workloadManager.nextHost() != null);
Finally, the spider shuts down the thread pool, and records the stopping time.
this.threadPool.shutdown();
this.stopTime = new Date();
At this point the spider is complete, and the process method returns.
Processing One Host
The processHost method is called for each host the spider needs to process. This
method will begin processing URLs on the workload that correspond to the current host. The
processHost method is called in a loop, by the process method, until all hosts have
been processed.
The processHost method begins by obtaining the current host.
URL url = null;
String host = this.workloadManager.getCurrentHost();
Next, the spider manager is notified that a new host is beginning.
if (!this.report.beginHost(host)) {
return;
}
Next, all filters are notified that we are moving to a new host.
for (SpiderFilter filter : this.filters) {
try {
filter.newHost(host, this.options.userAgent);
} catch (IOException e) {
logger.log(Level.INFO,
"Error while reading robots.txt file:"
+ e.getMessage());
}
}
Search WWH ::




Custom Search