Java Reference
In-Depth Information
The workloadManager property allows you to specify what sort of a workload
manager will handle the URL list. To specify an in-memory workload manager, use the fol-
lowing option:
com.heatonresearch.httprecipes.spider.workload.memory.MemoryWork-
loadManager
To specify an SQL based workload manager, use the following option:
com.heatonresearch.httprecipes.spider.workload.sql.SQLWorkloadMa-
nager
If you are going to use an SQL workload manager, you must also specify the dbURL and
dbClass . Additionally, you must also create a database that has the required tables. This
is covered in the next chapter. There are no options that need to be specified for a memory
workload.
A memory workload is more simple to setup than an SQL workload. However, a memory
workload can hold a limited number of sites and the memory workload manager is capable
of spidering only a single host. If you would like to spider multiple hosts, or very large hosts,
you must use an SQL workload manager.
There are two values you can specify for the startup property. First, if you specify
the value of clear , the entire workload will be erased, and the spider will start over. Sec-
ondly, if you specify the value of resume , the spider will resume where it left off from the
last run.
The filter property allows you to specify one or more filters to use. You should al-
ways make sure you use at least the RobotsFilter . This filter ensures that your bot is
compliant with the “Bot Exclusion Standard”. The Bot Exclusion Standard will be covered in
Chapter 16, “Well Behaved Bots”.
The following code shows how you might initialize a SpiderOptions object:
SpiderOptions options = new SpiderOptions();
spiderOptions.timeout(60000);
spiderOptions.maxDepth(-1);
spiderOptions.userAgent(null);
spiderOptions.corePoolSize(100);
spiderOptions.maximumPoolSize(100);
spiderOptions.keepAliveTime(60);
spiderOptions.dbURL(
"jdbc:mysql://127.0.0.1/spider?user="+
"testuser&password=testpassword");
spiderOptions.dbClass("com.mysql.jdbc.Driver");
spiderOptions.workloadManager(
"com.heatonresearch.httprecipes.spider."+
Search WWH ::




Custom Search