Java Reference
In-Depth Information
}
}
And here are the first few lines of output when
SourceViewer
downloads
http://
www.oreilly.com
:
&
lt
;!
DOCTYPE
HTML
PUBLIC
"-//W3C//DTD HTML 4.01 Transitional//EN"
&
gt
;
&
lt
;
html
xmlns
=
"http://www.w3.org/1999/xhtml"
lang
=
"en-US"
xml:
lang
=
"en-US"
&
gt
;
&
lt
;
head
&
gt
;
&
lt
;
title
&
gt
;
oreilly
.
com
--
Welcome
to
O
'
Reilly
Media
,
Inc
.
--
computer
books
,
software
conferences
,
online
publishing
&
lt
;/
title
&
gt
;
&
lt
;
meta
name
=
"keywords"
content
=
"
O
'
Reilly
,
oreilly
,
computer
books
,
technical
books
,
UNIX
,
unix
,
Perl
,
Java
,
Linux
,
Internet
,
Web
,
C
,
C
++,
Windows
,
Windows
NT
,
Security
,
Sys
Admin
,
System
Administration
,
Oracle
,
PL
/
SQL
,
online
books
,
books
online
,
computer
book
online
,
e
-
books
,
ebooks
,
Perl
Conference
,
Open
Source
Conference
,
Java
Conference
,
open
source
,
free
software
,
XML
,
Mac
OS
X
,
.
Net
,
dot
net
,
C
#
,
PHP
,
CGI
,
VB
,
VB
Script
,
Java
Script
,
javascript
,
Windows
2000
,
XP
,
There are quite a few more lines in that web page; if you want to see them, you can fire
up your web browser.
The shakiest part of this program is that it blithely assumes that the URL points to text,
which is not necessarily true. It could well be pointing to a GIF or JPEG image, an MP3
sound file, or something else entirely. Even if does resolve to text, the document en‐
coding may not be the same as the default encoding of the client system. The remote
host and local client may not have the same default character set. As a general rule, for
pages that use a character set radically different from ASCII, the HTML will include a
META
tag in the header specifying the character set in use. For instance, this
META
tag
specifies the Big-5 encoding for Chinese:
<
meta
http
-
equiv
=
"Content-Type"
content
=
"text/html; charset=big5"
>
An XML document will likely have an XML declaration instead:
<?
xml
version
=
"1.0"
encoding
=
"Big5"
?>
In practice, there's no easy way to get at this information other than by parsing the file
and looking for a header like this one, and even that approach is limited. Many HTML
files handcoded in Latin alphabets don't have such a
META
tag. Since Windows, Mac, and
most Unixes have somewhat different interpretations of the characters from 128 to 255,
the extended characters in these documents do not translate correctly on platforms
other than the one on which they were created.
And as if this isn't confusing enough, the HTTP header that precedes the actual docu‐
ment is likely to have its own encoding information, which may completely contradict
what the document itself says. You can't read this header using the
URL
class, but you
can with the
URLConnection
object returned by the
openConnection()
method. En‐
coding detection and declaration is one of the thornier parts of the architecture of the
Web.