Understanding a Digital Object: Basic Representation Information - Advanced Digital Preservation

Information Technology Reference

In-Depth Information

First the octets are arranged in big-endian format where the most significant octet

is the 0 octet which is read first on big-endian systems. Bit 0 of the 0 octet represents

the decimal integer value 2 31

2,147,483,648 and is the most significant bit. Bit

7 of octet number 3 represents the decimal integer value 2 0

=

1 and is the least

significant (in terms of its contribution to the decimal integer value). With little-

endian the least significant octet is read first and the most significant octet is read

last.

Every hardware computer system manipulates PDTs in one or more of the endian

formats. Reading little-endian data on a system that is big-endian without swap-

ping the octets will give incorrect results for the DVs, and hence its importance

as a fundamental property of the PDTs. Swapping the octets is a simple proce-

dure of reordering the octets, in this case converting from big-endian to little-endian

would involve moving octet 3 to appear first (reading left to right) then octet 2,

octet one and finally octet zero. Note that it is not simply reversing the order of

the bits!

=

7.3.1.2 Characters

Characters are digital representations of the basic symbols in human written lan-

guage. Typically they do not correspond to the glyph of a written character (such

as an alphabetic character) but rather are a code (code point) which can be used

to associate with the corresponding glyph (character encoding) or some other

representation.

One of the most common character encodings is ASCII [ 28 ]. ASCII is repre-

sented as seven bits making 128 possible character encodings. Not all the ASCII

characters are printable; some represent control symbols such as Tab or Carriage

Return which are used for formatting text. ASCII was extended to use octets with

the development of ISO/IEC 8859 giving a wider set (255) character encodings.

ISO/IEC 8859 [ 29 ] is split over 15 parts where the first part is ISO/IEC 8859-

1 is the Latin alphabet no. 1. Each part encodes for a different set of characters

and so a given encoding value (158 say) can correspond to different charac-

ters depending on what part is used. Typically a file containing text encoded

with say ISO/IEC 8859-1 would not be interpreted correctly if decoded with

ISO/IEC 8859-2, even though they are both text files with eight bit characters.

The encoding standard used for a text file is thus very important representation

information.

Recently a new set of standards have been developed to represent character

encodings, these new standards are called Unicode [ 30 ]. Unicode comes with sev-

eral character encodings, for example UTF-8, UTF-16 and UTF-32. UTF-8 is

intended to be backwards compatible with ASCII, in that it needs one octet to encode

the first 128 ASCII characters.

Unicode supports far more characters than just ASCII, it in fact tries to encode

the characters of all languages in common use (Basic Multilingual Plane) and even

historical languages such as Egyptian Hieroglyphs. This means that it requires more

Advanced Digital Preservation

Search WWH ::

Custom Search

Home