USING A SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

This method is called to download images and other binary objects. Anything that is not

HTML is downloaded by this method. HTML is handled differently because HTML contains

links to other pages. This method begins by creating a buffer to read the binary data.

byte[] buffer = new byte[1024];

int length;

Next, a filename is created. The filename uses the convertFilename function to

convert the URL into a file that can be saved to the local computer. The convertFilename

function also creates the directory structure to hold the specified file.

String filename = URLUtility.convertFilename(this.path, url,

true);

Next, the data is read in. It is read using the buffer variable that was created earlier.

try {

OutputStream os = new FileOutputStream(filename);

do {

length = stream.read(buffer);

if (length != -1) {

os.write(buffer, 0, length);

}

} while (length != -1);

Once the data has been read, the output stream can be closed.

os.close();

If any exceptions are caught, they are displayed to the user.

} catch (FileNotFoundException e) {

e.printStackTrace();

}

This recipe also has to handle HTML data. If a URL has HTML data, then the second

form of the spiderProcessURL method is used.

public void spiderProcessURL(URL url, SpiderParseHTML parse)

First, a filename is generated, just as was done for the binary URL. An OutputStream

is opened to write the file to.

String filename =

URLUtility.convertFilename(this.path, url, true);

OutputStream os = new FileOutputStream(filename);

The OutputStream is then attached to the ParseHTML object, so that any data

ready from the HTML stream is also written to the OutputStream . This saves the HTML

file to the local computer.

parse.getStream().setOutputStream(os);

Search WWH ::

Custom Search

Home