Understanding a Digital Object: Basic Representation Information - Advanced Digital Preservation

Information Technology Reference

In-Depth Information

than one octet to encode one character. UTF-8 actually allows a sequence of up to

four octets to represent one character which turns out to be quite a complex encoding

mechanism (described in the Unicode standard). UTF-16 contains two octets where

the byte-order is significant. The byte order of text encoded in UTF-16 is usually

indicated by a Byte Order Mark (BOM) at the start of the text. This BOM is the

byte sequence FEFF (hexadecimal notation) when the text is encoded in big-endian

byte-order or FFFE when the text is encoded in little-endian byte-order. FEFF also

represents the “zero-width no-break space” character, i.e. a character that does not

display anything or have any other effect and FFFE is guaranteed not to represent

any character.

One can conclude that a character is a sequence of bits (bit pattern) that can,

when encountered in data, be represented in a more meaningful form such as a

glyph or some other representation such as a decimal value etc. This implies that a

character type could in fact be more formally described by representing the whole

character set as an enumeration. The exact nature of the decoding from code to its

representation is data or even domain specific.

7.3.1.3 Integers

Integers come in a variety of flavours where the number of bits composing the inte-

ger varies or the range of the numbers the integer can represent varies. Typically

there are 8, 16, 32, 64 and 128 or more bits in integer types. In Fig. 7.5 ,the

big-endian 4 octet integer (32 bits) can be read as an unsigned integer with val-

ues ranging from 0 to 4,294,967,295. The exact value of the big-endian integer in

Fig. 7.5 is 2,736,100,710, but if it was read as little-endian without swapping the

octets then the value would read 1,721,046,435, but if swapped first one would still

get the correct value of 2,736,100,710.

Integers can also be signed. Usually the most significant bit is the sign bit (but

can be located elsewhere in the octets), zero for positive and one for negative. The

rest of the bits are used to represent the decimal values of the number.

In Fig. 7.5 the big-endian value as a signed integer is -1,558,866,586. We must

of course state how we calculated the decimal values of the integer. In the above

signed integer example we have actually used two's complement interpretation

of the bits. In two's complement the most significant bit is the sign bit and the

other bits are all inverted (zero goes to one, one goes to zero) and then one is

added, this gives the binary representation that can be read in the normal way.

There are other ways of interpreting integers, such as sign-and-magnitude, one's

complement etc. This method of interpretation is a fundamental property of digital

integers.

Integers then have three properties, the octet (byte) order, the location of the sign

bit and finally the way in which the bits should be interpreted (two's complement

etc). Integers can also be restricted in data value, i.e., they can have a minimum,

maximum (or both) or fixed value. For example, the EISCAT Matlab 4 format [ 31 ]

Search WWH ::

Custom Search

Home