Information Technology Reference
In-Depth Information
7.3.1.9 Strings
Strings are simply one dimensional array of characters. They can be mixed with
other PDTs in binary data or they can exist on their own, usually in text files. The
most important basic characteristic is that of the character PDT used in the string
(ASCII [ 28 ], UTF-8 [ 35 ]etc).
Strings can be structured or unstructured. When a string is unstructured there
are only two additional properties that characterise the string structure. The
first is the length in characters of the string and the second is the range of
allowed characters (“A”-“Z” say) that can appear in the string, though this is
optional.
When a string is structured it means that is contains a known set of sub-strings
each of which may or may not contain a limited set of characters. The most com-
mon way of defining the structure of stings is using a variant of the Backus Naur
Form (BNF) [ 36 ]. Extended Backus Naur Form (EBNF) - ISO-14977 [ 37 ]isa
standardised version of BNF.
Most text file formats, for example XML [ 38 ], use their own definitions of
BNF. BNF is used as a guide to producing parsers for a text file format, BNF is
not machine processable and has not been used to automatically generate code for
parsers. Usually a parser generator library is used to map the BNF/EBNF grammar
to the source code which involves hand-crafting code using the grammar as a guide.
Tools such as Yet Another Compiler Compiler (Yacc) [ 39 ] and the Java Compiler
Compiler (JavaCC) [ 40 ] can help in creating the parser. They are called compiler
compilers because they are used extensively in generating compliers for program-
ming languages. The source files for programming languages are usually text files
where the allowed syntax (string structures) are defined in some form of BNF, see
for example the C language standard [ 41 ].
BNF is not the only way of defining the structure of a string. Regular expressions
can also be used. Regular expressions can be thought of in terms of pattern matching
where a given regular expression matches a particular string structure. For example,
the regular expression
' structure '
|
' semantics '
matches the string 'structure' OR 'semantics' where the “
” symbol stands for OR.
One advantage of regular expressions over BNF is that the regular expression can
be use directly with software APIs that handle them. The Perl language [ 42 ]for
example has its own regular expression library that takes a specific form of regu-
lar expression, applies this to a string and outputs the locations in the string of the
matching cases. Other languages such as Java also have their own built-in regular
expression libraries. The main disadvantage of regular expression is the variability
of their syntax (usually not the same for all libraries the support them). The Portable
Operating System Interface (POSIX) [ 43 ] does define a standard regular expres-
sion syntax which is implemented on many UNIX systems. Another disadvantage
is that the expressions themselves can increase considerably in complexity as the
|
Search WWH ::




Custom Search