Understanding a Digital Object: Basic Representation Information - Advanced Digital Preservation

Information Technology Reference

In-Depth Information

7.3.1.9 Strings

Strings are simply one dimensional array of characters. They can be mixed with

other PDTs in binary data or they can exist on their own, usually in text files. The

most important basic characteristic is that of the character PDT used in the string

(ASCII [ 28 ], UTF-8 [ 35 ]etc).

Strings can be structured or unstructured. When a string is unstructured there

are only two additional properties that characterise the string structure. The

first is the length in characters of the string and the second is the range of

allowed characters (“A”-“Z” say) that can appear in the string, though this is

optional.

When a string is structured it means that is contains a known set of sub-strings

each of which may or may not contain a limited set of characters. The most com-

mon way of defining the structure of stings is using a variant of the Backus Naur

Form (BNF) [ 36 ]. Extended Backus Naur Form (EBNF) - ISO-14977 [ 37 ]isa

standardised version of BNF.

Most text file formats, for example XML [ 38 ], use their own definitions of

BNF. BNF is used as a guide to producing parsers for a text file format, BNF is

not machine processable and has not been used to automatically generate code for

parsers. Usually a parser generator library is used to map the BNF/EBNF grammar

to the source code which involves hand-crafting code using the grammar as a guide.

Tools such as Yet Another Compiler Compiler (Yacc) [ 39 ] and the Java Compiler

Compiler (JavaCC) [ 40 ] can help in creating the parser. They are called compiler

compilers because they are used extensively in generating compliers for program-

ming languages. The source files for programming languages are usually text files

where the allowed syntax (string structures) are defined in some form of BNF, see

for example the C language standard [ 41 ].

BNF is not the only way of defining the structure of a string. Regular expressions

can also be used. Regular expressions can be thought of in terms of pattern matching

where a given regular expression matches a particular string structure. For example,

the regular expression

' structure '

|

' semantics '

matches the string 'structure' OR 'semantics' where the “

” symbol stands for OR.

One advantage of regular expressions over BNF is that the regular expression can

be use directly with software APIs that handle them. The Perl language [ 42 ]for

example has its own regular expression library that takes a specific form of regu-

lar expression, applies this to a string and outputs the locations in the string of the

matching cases. Other languages such as Java also have their own built-in regular

expression libraries. The main disadvantage of regular expression is the variability

of their syntax (usually not the same for all libraries the support them). The Portable

Operating System Interface (POSIX) [ 43 ] does define a standard regular expres-

sion syntax which is implemented on many UNIX systems. Another disadvantage

is that the expressions themselves can increase considerably in complexity as the

|

Advanced Digital Preservation

Search WWH ::

Custom Search

Home