Java Reference
In-Depth Information
tp://en.wikipedia.org/wiki/Unicode ) .UnlikeXML1.0,XML1.1isnot
widely implemented and should be used only by those needing its unique features.
XMLsupportsUnicode,whichmeansthatXMLdocumentsconsistentirelyofchar-
acters taken from the Unicode character set. The document's characters are encoded
into bytes for storage or transmission, and the encoding is specified via the XML
declaration's optional encoding attribute. One common encoding is UTF-8 (see
http://en.wikipedia.org/wiki/UTF-8 ) , which is a variable-length encod-
ing of the Unicode character set. UTF-8 is a strict superset of ASCII (see ht-
tp://en.wikipedia.org/wiki/Ascii ), which means that pure ASCII text
files are also UTF-8 documents.
Note IntheabsenceoftheXMLdeclaration,orwhentheXMLdeclaration's en-
coding attribute is not present, an XML parser typically looks for a special charac-
ter sequence at the start of a document to determine the document's encoding. This
character sequence is known as the byte-order-mark (BOM) , and is created by an ed-
itor program (such as Microsoft Windows Notepad) when it saves the document ac-
cording to UTF-8 or some other encoding. For example, the hexadecimal sequence
EF BB BF signifies UTF-8 as the encoding. Similarly, FE FF signifies UTF-16 big
endian(see http://en.wikipedia.org/wiki/UTF-16/UCS-2 ),FFFEsig-
nifies UTF-16 little endian, 00 00 FE FF signifies UTF-32 big endian (see ht-
tp://en.wikipedia.org/wiki/UTF-16/UCS-2 ) ,andFFFE0000signifies
UTF-32 little endian. UTF-8 is assumed if no BOM is present.
If you'll never use characters apart from the ASCII character set, you can probably
forget about the encoding attribute. However, if your native language isn't English,
orifyouarecalled upontocreate XMLdocuments that include nonASCIIcharacters,
you need to properly specify encoding . For example, if your document contains
ASCII plus characters from a nonEnglish Western European Language (such as ç, the
cedilla used in French, Portuguese, and other languages), you might want to choose
ISO-8859-1 asthe encoding attribute'svalue—thedocumentwillprobablyhavea
smallersizewhenencodedinthismannerthanwhenencodedwithUTF-8. Listing10-2
shows you the resulting XML declaration.
Listing 10-2. An encoded document containing nonASCII characters
<?xml version="1.0" encoding="ISO-8859-1" ?>
<movie>
<name>Le Fabuleux Destin d'Amélie Poulain</name>
Search WWH ::




Custom Search