EXTRACTING DATA - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

HTML images are stored in the <img> tag. This tag contains an attribute, named src ,

that contains the URL for the image to be displayed. A typical HTML image tag looks like

this:

<img src="/images/logo.gif" width="320" height="200"

alt="Company Logo">

The only attribute that this recipe will be concerned with is the src attribute. The other

tags are option and may, or may not, be present.

Extracting Images

The method loops across every tag and text character in the HMTL file.

InputStream is = url.openStream();

ParseHTML parse = new ParseHTML(is);

When an HTML tag is found it is checked to see if it is an <img> tag. If the tag is an image

then the src attribute is analyzed to determine the path to the image.

int ch;

while ((ch = parse.read()) != -1)

{

if (ch == 0)

{

HTMLTag tag = parse.getTag();

if (tag.getName().equalsIgnoreCase("img"))

{

String src = tag.getAttributeValue("src");

To download the image we need the fully qualified URL. For example, the <img>

tag's src attribute may contain the value /images/logo.gif , what we need is

http://www.heatonresearch.com/images/logo.gif . To obtain this URL we use

the URL class as follows:

URL u = new URL(url, src);

Next we extract the filename from the URL and append the filename to a local path to

save the file to. The downloadBinaryPage method will download the image. This

method was covered in Chapter 3.

String filename = extractFile(u);

File saveFile = new File(saveTo, filename);

this.downloadBinaryPage(u, saveFile);

This method looks across all images on the page.

Search WWH ::

Custom Search

Home