Java Reference
In-Depth Information
HTML images are stored in the
<img>
tag. This tag contains an attribute, named
src
,
that contains the URL for the image to be displayed. A typical HTML image tag looks like
this:
<img src="/images/logo.gif" width="320" height="200"
alt="Company Logo">
The only attribute that this recipe will be concerned with is the
src
attribute. The other
tags are option and may, or may not, be present.
Extracting Images
The method loops across every tag and text character in the HMTL file.
InputStream is = url.openStream();
ParseHTML parse = new ParseHTML(is);
When an HTML tag is found it is checked to see if it is an <img> tag. If the tag is an image
then the
src
attribute is analyzed to determine the path to the image.
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();
if (tag.getName().equalsIgnoreCase("img"))
{
String src = tag.getAttributeValue("src");
To download the image we need the fully qualified URL. For example, the
<img>
tag's
src
attribute may contain the value
/images/logo.gif
, what we need is
http://www.heatonresearch.com/images/logo.gif
. To obtain this URL we use
the
URL
class as follows:
URL u = new URL(url, src);
Next we extract the filename from the URL and append the filename to a local path to
save the file to. The
downloadBinaryPage
method will download the image. This
method was covered in Chapter 3.
String filename = extractFile(u);
File saveFile = new File(saveTo, filename);
this.downloadBinaryPage(u, saveFile);
This method looks across all images on the page.