Java Reference
In-Depth Information
HTML images are stored in the <img> tag. This tag contains an attribute, named src ,
that contains the URL for the image to be displayed. A typical HTML image tag looks like
this:
<img src="/images/logo.gif" width="320" height="200"
alt="Company Logo">
The only attribute that this recipe will be concerned with is the src attribute. The other
tags are option and may, or may not, be present.
Extracting Images
The method loops across every tag and text character in the HMTL file.
InputStream is = url.openStream();
ParseHTML parse = new ParseHTML(is);
When an HTML tag is found it is checked to see if it is an <img> tag. If the tag is an image
then the src attribute is analyzed to determine the path to the image.
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();
if (tag.getName().equalsIgnoreCase("img"))
{
String src = tag.getAttributeValue("src");
To download the image we need the fully qualified URL. For example, the <img>
tag's src attribute may contain the value /images/logo.gif , what we need is
http://www.heatonresearch.com/images/logo.gif . To obtain this URL we use
the URL class as follows:
URL u = new URL(url, src);
Next we extract the filename from the URL and append the filename to a local path to
save the file to. The downloadBinaryPage method will download the image. This
method was covered in Chapter 3.
String filename = extractFile(u);
File saveFile = new File(saveTo, filename);
this.downloadBinaryPage(u, saveFile);
This method looks across all images on the page.
Search WWH ::




Custom Search