Java Reference
In-Depth Information
The downloadPage function is the one we created in Chapter 3; however, you will
notice an additional parameter. The second parameter specifies a 1,000-millisecond timeout.
If a connection is not made in one second, which is 1,000 milliseconds, the connection will
abort and throw an exception.
Next the extractNoCase function is called to extract the text between the
<title> and </title> tags. The extractNoCase is a special version of the ex-
tract function introduced in Chapter 3. The extractNoCase version of extract does not
care about the case of the tags. For example, <title> and <Title> would be consid-
ered the same. If no title is found, then the site is listed as an “Untitled Site”.
String title = extractNoCase(page, "<title>", "</title>", 0);
if (title == null)
title = "[Untitled site]";
return title;
If an exception occurs, then a value of null is returned, indicating that a site could not
be found.
} catch (IOException e)
{
return null;
}
This recipe makes use of the extractNoCase which is a new version of extract
function. Both of these functions can be seen in Listing 4.2. They are both slight modifica-
tions of the functions introduced in Chapter 3. For more information on these functions, see
Chapter 3.
Recipe #4.3: Download Binary or Text
Downloading a file from a URL is a common task for a bot. However, different procedures
must be followed depending on the type of file being downloaded. If the file is binary, such as
an image, then an exact copy of the file must be made on the local computer. If the file is text,
then the line breaks must be properly formatted for the current operating system.
Chapter 3 introduced two recipes for downloading files from a URL. One version would
download a text file; the other would download a binary file. As you saw earlier in this chap-
ter, the content-type header tells what type of file will be downloaded. Recipe 4.3 contains a
more sophisticated URL downloader, than that in Chapter 3. It first determines the type of file
and then downloads it in the appropriate way. Listing 4.3 shows this new URL downloader.
Listing 4.3: Download Text or Binary (DownloadURL.java)
package com.heatonresearch.httprecipes.ch4.recipe3;
import java.net.*;
import java.io.*;
Search WWH ::




Custom Search