Java Reference
In-Depth Information
Finally, readAll is called to read the entire HTML file. The HTML file will be written
to the attached OutputStream , as it is parsed.
parse.readAll();
os.close();
Now that the file has been written, the OutputStream can be closed.
Recipe #13.3: Spider the World
Perhaps the most well known of all spiders, are the search engine spiders. These are the
spiders used by sites, such as Google, to add new sites to their search engines. Such spiders
are not designed to stay on one specific site. In this recipe I will show you how to create a spi-
der that will not restrict itself to one site; rather, this spider will keep following links endlessly.
It is important to note, it is very unlikely that this spider will ever finish, since it will have to
visit nearly every public URL on the Internet to do so.
To start this spider, you must provide three arguments. The first argument is the name
of the spider configuration file. Through the spider configuration file, you can specify to use
either an SQL or memory based workload manager. Listing 13.1 shows an example spider
configuration file. Next, a local directory must be specified to download the site to. Finally,
the starting URL must be specified.
The following shows how you might start the spider.
WorldSpider spider.conf c:\temp\ http://www.example.com
The above command simply shows the abstract format to call this recipe, with the appro-
priate parameters. For exact information on how to run this recipe refer to Appendix B, C, or
D, depending on the operating system you are using. This spider is designed to access a large
number of sites. You should use the SQLWorkloadManager class with this spider. Be-
cause the MemoryWorkloadManager is only designed to work with one single host,
it would not be compatible with this spider.
Now that you have seen how to use the world spider we will examine how it was con-
structed.
Creating the World Spider
The WorldSpider class contains the main method for the recipe. Listing 13.8
shows the WorldSpider class.
Listing 13.8: Download the World (WorldSpider.java)
package com.heatonresearch.httprecipes.ch13.recipe3;
import java.io.*;
import java.net.*;
Search WWH ::




Custom Search