Java Reference
In-Depth Information
}
}
And here are the first few lines of output when SourceViewer downloads http://
www.oreilly.com :
& lt ;! DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" & gt ;
& lt ; html xmlns = "http://www.w3.org/1999/xhtml" lang = "en-US" xml: lang = "en-US" & gt ;
& lt ; head & gt ;
& lt ; title & gt ; oreilly . com -- Welcome to O ' Reilly Media , Inc . -- computer books ,
software conferences , online publishing & lt ;/ title & gt ;
& lt ; meta name = "keywords" content = " O ' Reilly , oreilly , computer books , technical
books , UNIX , unix , Perl , Java , Linux , Internet , Web , C , C ++, Windows , Windows
NT , Security , Sys Admin , System Administration , Oracle , PL / SQL , online books ,
books online , computer book online , e - books , ebooks , Perl Conference , Open Source
Conference , Java Conference , open source , free software , XML , Mac OS X , . Net , dot
net , C # , PHP , CGI , VB , VB Script , Java Script , javascript , Windows 2000 , XP ,
There are quite a few more lines in that web page; if you want to see them, you can fire
up your web browser.
The shakiest part of this program is that it blithely assumes that the URL points to text,
which is not necessarily true. It could well be pointing to a GIF or JPEG image, an MP3
sound file, or something else entirely. Even if does resolve to text, the document en‐
coding may not be the same as the default encoding of the client system. The remote
host and local client may not have the same default character set. As a general rule, for
pages that use a character set radically different from ASCII, the HTML will include a
META tag in the header specifying the character set in use. For instance, this META tag
specifies the Big-5 encoding for Chinese:
< meta http - equiv = "Content-Type" content = "text/html; charset=big5" >
An XML document will likely have an XML declaration instead:
<? xml version = "1.0" encoding = "Big5" ?>
In practice, there's no easy way to get at this information other than by parsing the file
and looking for a header like this one, and even that approach is limited. Many HTML
files handcoded in Latin alphabets don't have such a META tag. Since Windows, Mac, and
most Unixes have somewhat different interpretations of the characters from 128 to 255,
the extended characters in these documents do not translate correctly on platforms
other than the one on which they were created.
And as if this isn't confusing enough, the HTTP header that precedes the actual docu‐
ment is likely to have its own encoding information, which may completely contradict
what the document itself says. You can't read this header using the URL class, but you
can with the URLConnection object returned by the openConnection() method. En‐
coding detection and declaration is one of the thornier parts of the architecture of the
Web.
Search WWH ::




Custom Search