Java Reference
In-Depth Information
handles the byte-to-character translation), looking for the delimiter sequence, and returning
the character string preceding it.
Unfortunately, the Reader classes do not support reading binary data. Moreover, the
relationship between the number of bytes read from the underlying InputStream and the
number of characters read from the Reader is unspecified, especially with multibyte encodings.
When a message uses a combination of the two framing methods mentioned above, with some
explicit-length-delimited fields and others using character markers, this can create problems.
The class Framer , defined below, allows an InputStream to be parsed as a sequence of
fields delimited by specific byte patterns. The static method Framer.nextToken() reads bytes
from the given InputStream until it encounters the given sequence of bytes or the stream ends.
All bytes read up to that point are then returned in a new byte array. If the end of the stream is
encountered before any data is read, null is returned. The delimiter can be different for each
call to nextToken() , and the method is completely independent of any encoding.
A couple of words of caution are in order here. First, nextToken() is terribly inecient;
for real applications, a more ecient pattern-matching algorithm should be used. Second,
when using Framer.nextToken() with text-based message formats, the caller must convert the
delimiter from a character string to a byte array and the returned byte array to a character
string. In this case the character encoding needs to distribute over concatenation, so that it
doesn't matter whether a string is converted to bytes all at once, or a little bit at a time.
To make this precise, let
represent an encoding—that is, a function that maps
character sequences to byte sequences. Let
E( )
a
and
b
be sequences of characters, so
E(a)
denotes the sequence of bytes that is the result of encoding
a
. Let “
+
” denote concatenation
of sequences, so
. This explicit-conversion
approach (as opposed to parsing the message as a character stream) should only be used with
encodings that have the property that
a + b
is the sequence consisting of
a
followed by
b
; otherwise, the results may be
unexpected. Although most encodings supported in Java have this property, some do not.
In particular, UnicodeBig and UnicodeLittle encode a String by first outputting a byte-order
indicator (the 2-byte sequence 254-255 for big-endian, and 255-254 for little-endian), followed
by the 16-bit Unicode value of each character in the String , in the indicated byte order. Thus,
the encoding of “Big fox” using UnicodeBig is as follows:
E(a + b) = E(a) + E(b)
254 255
102 0 120
[mark] 'B' 'i' 'g' ' ' 'f' 'o' 'x'
0
66
0
105
0
103
0 20
111
while the encoding of “Big” concatenated with the encoding of “fox”, using the same encoding,
is as follows:
254
102 0 120
[mark] 'B' 'i' 'g' [mark] ' ' 'f' 'o' 'x'
255
0
66
0
105
0
103
254
255
0 20
111
Using either of these encodings to convert the delimiter results in a byte sequence that
begins with the byte-order marker. Moreover, if the byte array returned by nextToken() does not
begin with one of the markers, any attempt to convert it to a String using one of these encodings
Search WWH ::




Custom Search