BEYOND SIMPLE REQUESTS - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

The downloadPage function is the one we created in Chapter 3; however, you will

notice an additional parameter. The second parameter specifies a 1,000-millisecond timeout.

If a connection is not made in one second, which is 1,000 milliseconds, the connection will

abort and throw an exception.

Next the extractNoCase function is called to extract the text between the

<title> and </title> tags. The extractNoCase is a special version of the ex-

tract function introduced in Chapter 3. The extractNoCase version of extract does not

care about the case of the tags. For example, <title> and <Title> would be consid-

ered the same. If no title is found, then the site is listed as an “Untitled Site”.

String title = extractNoCase(page, "<title>", "</title>", 0);

if (title == null)

title = "[Untitled site]";

return title;

If an exception occurs, then a value of null is returned, indicating that a site could not

be found.

} catch (IOException e)

{

return null;

}

This recipe makes use of the extractNoCase which is a new version of extract

function. Both of these functions can be seen in Listing 4.2. They are both slight modifica-

tions of the functions introduced in Chapter 3. For more information on these functions, see

Chapter 3.

Recipe #4.3: Download Binary or Text

Downloading a file from a URL is a common task for a bot. However, different procedures

must be followed depending on the type of file being downloaded. If the file is binary, such as

an image, then an exact copy of the file must be made on the local computer. If the file is text,

then the line breaks must be properly formatted for the current operating system.

Chapter 3 introduced two recipes for downloading files from a URL. One version would

download a text file; the other would download a binary file. As you saw earlier in this chap-

ter, the content-type header tells what type of file will be downloaded. Recipe 4.3 contains a

more sophisticated URL downloader, than that in Chapter 3. It first determines the type of file

and then downloads it in the appropriate way. Listing 4.3 shows this new URL downloader.

Listing 4.3: Download Text or Binary (DownloadURL.java)

package com.heatonresearch.httprecipes.ch4.recipe3;

import java.net.*;

import java.io.*;

Search WWH ::

Custom Search

Home