Java Reference
In-Depth Information
Table 13.1: Spider Configuration Options
Configuration Option Purpose
timeout
How long to wait for a connection/read (in milliseconds).
maxDepth
How deep to search for links (1 for homepage only, -1 infinite
depth).
userAgent
What user agent to report, blank to report the Java user agent.
corePoolSize
Core thread pool size, the minimum number of threads to use.
maximumPoolSize
Maximum thread pool size.
keepAliveTime
How long to keep an idle thread alive (seconds).
dbURL
The JDBC URL of a database to use.
dbClass
The full class name of a JDBC driver to use.
workloadManager
The full class name of a workload manager to use.
startup
What to do on startup. Specify “clear” to clear the workload or
“resume” to resume processing.
filter
The full class path of a filter class. More than one filter can be
specified.
The timeout value allows you to define the amount of time, in milliseconds that you will
wait for a page to load.
The userAgent property allows you to specify the User-Agent header that the
spider will report when accessing web sites. It is usually a good idea to create a specific user
agent for your spider so that it can be identified. If you set this value to null , then the de-
fault Java user agent will be used.
The corePoolSize , maximumPoolSize , and keepAliveTime proper-
ties allow you to define how the thread pool works. The thread pool is a pool of threads that
waits to handle tasks assigned to it by the spider. Using a thread pool makes a spider very
efficient. This is because a spider is often waiting for a web server to respond. If a spider
can be waiting on multiple pages at the same time, the over all performance is greatly in-
creased. The maximumPoolSize property states the maximum number of threads in
the pool. If a single thread has no work to do after the number of seconds specified by the
keepAliveTime property, then the thread will be killed. No threads will be killed once
the thread pool has reached the size specified by the corePoolSize property.
The dbURL and dbClass properties allow you to define a JDBC database. For more
information on how to set these values, refer to the next chapter.
Search WWH ::




Custom Search