Java Reference
In-Depth Information
tp://en.wikipedia.org/wiki/Unicode
)
.UnlikeXML1.0,XML1.1isnot
widely implemented and should be used only by those needing its unique features.
XMLsupportsUnicode,whichmeansthatXMLdocumentsconsistentirelyofchar-
acters taken from the Unicode character set. The document's characters are encoded
into bytes for storage or transmission, and the encoding is specified via the XML
declaration's optional
encoding
attribute. One common encoding is
UTF-8
(see
http://en.wikipedia.org/wiki/UTF-8
)
, which is a variable-length encod-
ing of the Unicode character set. UTF-8 is a strict superset of ASCII (see
ht-
tp://en.wikipedia.org/wiki/Ascii
), which means that pure ASCII text
files are also UTF-8 documents.
Note
IntheabsenceoftheXMLdeclaration,orwhentheXMLdeclaration's
en-
coding
attribute is not present, an XML parser typically looks for a special charac-
ter sequence at the start of a document to determine the document's encoding. This
character sequence is known as the
byte-order-mark (BOM)
, and is created by an ed-
itor program (such as Microsoft Windows Notepad) when it saves the document ac-
cording to UTF-8 or some other encoding. For example, the hexadecimal sequence
EF BB BF signifies UTF-8 as the encoding. Similarly, FE FF signifies UTF-16 big
nifies UTF-16 little endian, 00 00 FE FF signifies UTF-32 big endian (see
ht-
tp://en.wikipedia.org/wiki/UTF-16/UCS-2
)
,andFFFE0000signifies
UTF-32 little endian. UTF-8 is assumed if no BOM is present.
If you'll never use characters apart from the ASCII character set, you can probably
forget about the
encoding
attribute. However, if your native language isn't English,
orifyouarecalled upontocreate XMLdocuments that include nonASCIIcharacters,
you need to properly specify
encoding
. For example, if your document contains
ASCII plus characters from a nonEnglish Western European Language (such as ç, the
cedilla used in French, Portuguese, and other languages), you might want to choose
ISO-8859-1
asthe
encoding
attribute'svalue—thedocumentwillprobablyhavea
smallersizewhenencodedinthismannerthanwhenencodedwithUTF-8.
Listing10-2
shows you the resulting XML declaration.
Listing 10-2.
An encoded document containing nonASCII characters
<?xml version="1.0"
encoding="ISO-8859-1"
?>
<movie>
<name>Le Fabuleux Destin d'Amélie Poulain</name>