Programming-Language Design (Compiler Writing) Part 3

Data Structure

Four aspects of data structure will be considered in this subsection. First, a discussion of the alternative forms for data declarations is presented. Next, an overview of the variety of data types that are available in programming languages today and that should be considered if a new-language design is undertaken. The effects of storage-allocation strategies on language-design aspects are then described. The final topic deals with a discussion of the scope of variables in a programming language.

The declarations of a language are the means by which nonprocedural information is conveyed to the compiler. While there exists a large number of specialized forms of declarations to meet specific needs, this discussion will focus on the three common requirements of most programming languages: declarations for constants, types, and variables.

It has recently become obvious that it is extremely useful to be able to give names to constants. This avoids the use of "magic numbers" in the body of the code. In this way program readability is greatly improved and it is much easier to change the value of, say, a table size. Some languages (see Clark and Horning, 1971) have even taken the position that, with the exception of 0, 1, and 2, all numeric constants occurring in the code must be given names by constant declarations. Consequently, "magic numbers" may not occur anywhere except in such declarations.

Another aspect of the declaration of constants is that it is useful to be able to write a constant for any data type in the program. For example, it should be possible to write an array constant. Such constants would largely eliminate the need to initialize arrays, as most initialized arrays are really array constants, and would prevent the programmer from accidentally destroying the contents of such a constant.

Similarly, it is useful, essentially as a form of abbreviation, to be able to give names to types and then use those names in the declaration of variables or in the construction of more complex types. Such type declarations can save a great deal of writing while also improving readability.

The syntax of constant and type declarations can be similar, as in the following:

CONST table_size = 437;

TYPE big_array = ARRAY [1 TO 1000, 1 TO 1000] OF integer;

The similarities between the two may bring up a tempting thought: Why not generalize both to a parameterless macrofacility?

The basic reason to avoid this is that constants and types are, by and large, the only things that would be declared. Consequently, such an overgenerajized construct would be an open invitation to "clever" tricky programming.

If the designer decides not to allow initialization of variables, the syntax of variable declarations becomes quite simple. A very readable syntax is that used by PASCAL:

VAR a, b, c: integer;

Note that this syntax is really suitable only when all information about, say, array bounds is localized in the type specification. Readability of this form diminishes rapidly if the list of variable names is cluttered up with other things such as initialization.

Let us now turn to a discussion of data types in programming languages. The basic reference for modern data types is Hoare’s work (Hoare, Dahl, and Dijkstra 1972). Wirth’s work on PASCAL (Jensen and Wirth, 1975) is worth looking at as a practical implementation of many of Hoare’s ideas. Since the material on this topic is heavily based on these two documents, no further mention of these references is made. Hehner (1975) and Tennent (1975) also present some interesting ideas.

The basis of all notions of data structure is the data type. What is a data type? Although there has been considerable difference of opinion on this, the most prevalent view is that a data type is a set of values plus a set of operations operating on these values. For example, "integer" might be the set of values "…,- 2, -1,0,1,2, …" and the set of operations " + ,-,*,/, «- ." Note that the presence of the operations as part of the definition also implies that the members of the set of values cannot be chosen arbitrarily; there must be some consistency among them so that the same operations will apply to all.

There are three distinct approaches to types in programming languages. The first is to have none at all, characteristic of assembly language and some "midlevel" languages. The desirability of types is well established; so no further attention will be given to their absence. The remaining two approaches are called "hard" and "soft" typing. In both approaches a value has a type associated with it. The basic distinction between the two approaches is that in a soft-typed language any variable may contain any value (as in SNOBOL and APL), whereas in a hard-typed language a specific type is associated with every variable and the variable may contain only values belonging to that type’s set of values.

Generally speaking, it is now accepted that while soft typing renders a language slightly smaller and somewhat more concise, hard typing is vastly superior from the viewpoint of compile-time error checking (Wirth, 1974) and introduces no noticeable inconvenience. Further discussion will assume hard typing.

While no two language designers agree on exactly how the broad variety of data types should be subdivided, types can generally be grouped into three categories: simple, compound, and complex. These terms are relative and the boundaries between them are fuzzy.

The simple data types are on the comparatively primitive level of a language; they are used to build more complex types and are usually provided more or less directly by the machine upon which the compiler is implemented.

One matter of general interest in simple types is the question of ordering; that is,-are the values of a type considered to be in a specific sequence? This basically determines whether comparisons such as " < " are valid for all simple types, and it may also influence the counted loop of the language. In some cases, such as numbers, ordering is obviously required. For some other situations, the precise nature of the ordering is less obvious. Specific comments will be made on this point throughout the subsection.

The most elementary kind of simple type is what has been called the enumeration type, in which the programmer simply gives a list of identifiers as the set of values. Examples of this case are:

TYPE color = (red, green, blue)

TYPE job_status = (executing, waiting, terminated)

Note that these sample declarations constitute declarations of the values "red," "green," etc., as well as the identifiers "color" and "job_status." The valid operations on values of such types are equal and not-equal comparisons and assignment. The constants of this type are simply the names given. Obviously, within the machine the values of an enumeration type are represented as small integers, but this is not visible to the programmer; "red" is not a value of type "integer." It is not clear whether enumeration types should be ordered or not; perhaps the programmer should be able to ask for a specific enumeration type to be ordered. Without them, the programmer is frequently driven to simulate them using integers, a process which works but is error-prone. Any language intended for serious software work should have enumeration types.

Probably the most common form of a data type is the number. There is really very little that can be said about floating-poirtt numbers, other than that they are, at this time, machine-dependent in almost all language implementation. The discussion will therefore be confined largely to integers. While the simple type "integer" is straightforward enough, there are two notable modifications that can be made.

The first modification involves the idea of the "subrange," a type which has, as its value set, a subset of the integers. Examples of this modification are as follows:

TYPE cents = 0 TO 99

TYPE day_of_month = 1 TO 31

TYPE inning = 1 TO 9

Subranges allow the compiler to optimize storage allocation, since it is easy to determine exactly how much space a subrange variable needs. Subrange types are a much more readable way of doing this than things like " short_integer." Subranges are also machine-independent, whereas "integer" is not. Finally, and most importantly, subranges provide very valuable redundancy; the compiler can easily generate code to check that the value being assigned into a subrange variable is within the subrange. A certain amount of this checking can even be done at compile time.

While subranges are definitely valuable, one respect in which they are somewhat weak is in formal definition. The close relationship between subranges and integers complicates the process of defining them because the programmer naturally expects them to be usable anywhere integers are. This means that a formal definition must consider subrange values to be automatically converted to integer values whenever necessary. Probably the best solution to this is to say that a subrange type is really "integer" in disguise, and that the only respect in which it is not identical to "integer" is that a subrange variable may store only a subset of the full " integer" value set. It may also be helpful to have subranges of other types than "integer" (e.g., subranges of "character"). It might be best to say that any ordered type may have subranges defined on it.

The other notable modification to "integer" is one that is also applicable to floating-point. The idea, which is originally due to Hoare (1973), is that one should be able to associate units (e.g., kilograms, clock ticks, meters per second) with numeric variables. This is essentially a way of providing more useful redundancy. The compiler could check to see that operations performed on numeric values were indeed valid. For example, adding kilograms to clock ticks is definitely somebody’s error; a value obtained by multiplying meters per second by seconds should be assigned only into a variable containing meters. A full facility for specifying such things would be somewhat complex (consider, for example, conversions from feet to meters, the use of Newtons as a synonym for kilogram-meters per second-squared), but the idea is novel and is supported by Ada.

After the numeric data types, the "character" types are probably the most frequently used. It is unfortunate that current hardware does not support truly variable-length strings well. The result is a multiplicity of approaches to characters:

1. "Character" is a simple type, holding one character. Strings are implemented as arrays of characters.

2. Strings may vary in length but must have a specified maximum length for purposes of storage allocation.

3. Strings may vary arbitrarily in length (efficiency penalties are accepted for the sake of usability).

A fourth approach, the use of fixed-length strings, will not be considered.

The first approach essentially bypasses the problem. The properties of the array are now the key to the usefulness of strings. Unfortunately, arrays are quite ill-suited to the implementation of strings; one can either accept the poor string handling that results, as in PASCAL, or attempt to stretch the properties of the array and generally meet with undesirable results, as in ALGOL 68.

The second approach is possibly the best compromise; unfortunately, it has the usual properties of compromises. It is not as simple as alternative 1 or as usable as alternative 3. If the language must run without extensive run-time support (e.g., for writing an operating system) or must be very efficient, this is probably the method to use.

Environmental or efficiency constraints permitting, the third approach is the best one. The loss in efficiency can often be surprisingly small, and the gain in usability as exhibited in SNOBOL can be sizable.

The last of the simple types is "Boolean," which is infrequently used but quite essential when it does occur. The PL/I approach is to generalize the Boolean variable into a string of bits analogous to a string of characters. Since actual measurements (Alexander, 1972) show that the overwhelming majority of bit strings are actually one bit long, the generalization seems to have been unnecessary. For most languages, Boolean is sufficient.

Although Boolean could be defined as a separate data type, the simplest approach is simply to predefine it, as in PASCAL, by

TYPE boolean = (false, true)

If enumeration types are available, this approach satisfies all requirements and needs no further intricacies inside the compiler.

Now that we have finished with the simple types, the next subject of discussion is the compound types. These are types constructed in fairly simple ways from simpler types.

The most obvious variety of compound type is the array. Two aspects of it deserve comment. First, consideration should be given to extending subscripts to data types other than integers. While general floating-point numbers are obviously unsuitable as subscripts, enumeration types and single characters can quite often be usefully employed as subscripts. The approach taken in PASCAL in fact takes advantage of this. The dimensions of an array are given in terms of the types of its subscripts:

TYPE nonsense = ARRAY [color, 1 TO 99] OF integer

(see the definition of color earlier).

The second issue concerning arrays is the question of dynamic arrays, whose sizes are determined at block entry rather than at compile time. Dynamic arrays add to compile-time complexity and, for this reason, have been left out of some languages. However, they can be very useful in applications where array size is dependent on input data. The increase in complexity is often outweighed by the usefulness of dynamic arrays, and therefore in many cases dynamic arrays should be included in a language.

The other compound data type is the record, sometimes known, confusingly, as the "structure." For many applications it is just as fundamental as the array. The lack of records is probably a major factor in the rejection of ALGOL 60 by the data-processing community.

One fairly useful embellishment of the record is the tagged-variant record. This allows for the frequent situations where the structure of part of the record is dependent on the data in an earlier part of it:


The record has some fixed fields (e.g., "given_names") followed by a tag field which selects which variant of the remainder of the record is in use. The tag field is named "sex" and is of the enumeration type "(male, female)." The tag field is followed by the variants, which in this case are either " married_name" and "maiden_name" or "last_name." With such an arrangement, the compiler can set up run-time checks to ensure that any reference, for example to "maiden_name," is referencing a record whose "sex" field is "female." PASCAL and Ada support such a facility.

COBOL has the interesting notion of being able to use "filler" as a field name in a record. This feature essentially gives the programmer an anonymous field of the given size. This can be useful if a file of records is being processed by several programs, some of which ignore certain fields.

One less than obvious compound data type is a simple form of set. While sets in general are part of a subject covered later under "complex" data types, one restricted form of set deserves mention, since it can be implemented simply and efficiently but is still fairly useful. If the number of possible elements of a set is quite small, it is possible to implement the set by simply allocating a field of bits. One bit is allocated for each possible element. The bit is 1 or 0, to indicate whether or not the element is currently in the set.

With this scheme, set manipulations are done easily by the machine’s logical instructions. For example, a set over an enumeration type

TYPE colorset = SET OF color;

(see definition of "color" earlier) can be especially useful.

The final compound data type that will be considered is the pointer. The pointer is referred to as a compound type as opposed to a simple type because it is most wisely used in conjunction with and not independent of another data type. In fact, it is fairly widely agreed that undisciplined use of pointers is, if anything, even worse than undisciplined use of the GOTO. There are two major programming-language design solutions to this.

The first solution is Hoare’s recursive data types (Hoare, Dahl, and Dijkstra, 1972), which eliminate the explicit use of a pointer altogether. The basic idea behind recursive data types is that instead of having, say, one field of a record point to another record, the second record is conceptually a field of the first. A recursive data type is one in which the name of the type being defined occurs in its own definition. Such a notion has been used in the definition of a list in LISP and in defining a recursive pattern in SNOBOL 4, in particular. Trees can also be defined in such a manner. In actual fact the implementation at the machine level is the same, via a pointer, but recursive data types hide this completely from the programmer. A recursive data type is a relatively new concept, and it is too early yet to tell whether it will prove to be decisively superior to the alternative, namely, retaining the pointer but placing restrictions on it.

The second solution basically depends on what restrictions should be placed on the pointer. There are two restrictions which appear to fill the requirements. The first is to require the pointer to point to an object of a specific type. In other words, one does not declare a variable to be of type "POINTER"; one declares it to be of type, say, "POINTER TO integer." This restriction alone eliminates most of the hazards of pointers.

The second restriction, which has been suggested by Wirth (1974), is that there should be no "address-of" operator; that is, it should be impossible to make a pointer point at a named variable. Pointers should point only into anonymous, dynamically allocated heap storage. This largely prevents the possibly serious confusion arising from referencing the same storage location under several different names.

The final variety of data type is the complex data type, which is one involving nontrivial implementation effort, nontrivial overhead, and a significant probability that a particular implementation will be far from optimal for some users. Examples of some such data types are lists, general sets, trees, stacks, and queues.

While such data types can be implemented using pointers or recursive data types, such implementations are very hard to change if it is discovered that they need tuning (e.g., if the implementation is very fast, it generally turns out that storage space is the real bottleneck). The long-range solution to this appears to be the idea of data abstraction.

The basic notion of data abstraction is that it should be possible to define a "user interface" for such a data type. This interface specifies what the various operations do to values of the type without mentioning how the values or the operations are implemented. The implementation itself then consists of specifying how the values are implemented (in terms of more primitive data types) and how the operations are implemented (a procedure for each operation). The user can then use the data type without knowing or caring about the details of the implementation, and the implementation can be changed without rewriting user programs.

Data abstractions are currently in the early experimental stages, and there is no agreement on the details. For those wishing to investigate further, Horning (1976) is a more detailed introduction to the ideas presented here, Liskov and Zilles (1974) gives an excellent demonstration of the use of data abstractions, and Shaw (1976) and the rest of the SIGPLAN (1976) issue constitute an excellent overview of the subject.

One feature of some languages is the provision for automatic conversions or coercions. An example of a conversion is converting the string "123" into the numeric value 123 if the string is added to a numeric variable. Coercions have been mentioned in Sec. 3-4.2. There are good reasons why such features should not be included in clean languages. They complicate the language excessively, hamper genuine extensibility (e.g., data abstraction), and severely impede readability because of frequent and excessive violation of the principle of least astonishment. This does not preclude the inclusion of data conversions and pointer chasing in new languages; it simply means that such things should always be explicitly requested and should not be supplied behind the programmer’s back.

Explicit conversion facilities may take one of two forms. The simplest form is that of functions which take a value of one type and return a value of another type (e.g., a function "round" taking "real" and returning "integer"). The alternative is the ALGOL 68 approach, which has a special operator that takes an input value and a data type and converts the input value to the given type.

The allocation of storage for variables in a program is, in its details, the business of the compiler and the run-time system. Certain overall policies, however, are part of the language design. There are basically four forms of variable allocation; they will be discussed in turn, with comments on each.

The simplest and oldest form of allocation is static allocation, immortalized in FORTRAN and in ALGOL 60′s OWN facility. It now appears that the major usefulness of static allocation is its use not within procedures but globally. Most cases where one would use ALGOL-60-style OWN can be better organized as a set of static variables with a set of procedures to operate on them. This makes static allocation relevant to data abstraction and modularity (see Sec. 3-5.5).

The most popular form of allocation, pioneered by ALGOL 60, is local, dynamic, or automatic allocation. The usefulness and implementation simplicity of local allocation make it the form of choice for normal variables within procedures.

The third form of allocation, retention, is one that has not been used extensively. This is different from both static and local allocation, in that storage is allocated on entry to a procedure or block, but it is not necessarily freed on exit — it remains allocated and may again become accessible via various mechanisms. It currently appears that the main usefulness of retention is in the implementation of backtracking algorithms, but broader applications may yet be found.

The final form of allocation is another familiar one. the allocation on explicit program request of anonymous storage ("heap" allocation is the ALGOL 68 term), with the program keeping track of it by a pointer. An exactly analogous process goes on, slightly less visibly, when recursive data types are in use. A heap form of allocation is a standard feature of many programming languages (e.g., ALGOL 68 and PL/I). One point relevant to implementation is that while programmers have the freedom to request allocation of heap storage at any time, they probably should not also have the freedom to do their own deallocation. This will most assuredly result in "dangling pointers." Recovering storage that is no longer accessible to the program should be the responsibility of the run-time system, not the programmer.

The final major issue regarding data structures is the scope of names. The original purpose of restricting the scope of names in ALGOL 60 was largely to facilitate storage allocation. Scope and storage allocation are no longer as strongly coupled as they once were. Nevertheless, restrictions on the scope of names are still present to assist in reducing the complexity of programs.

In order to maintain the complexity of programs and segments of programs within human grasp, it is necessary to restrict the interactions between different segments. Accessing a given variable is one of the most important interactions between the different parts of a program, and for this reason scope restrictions have continued in programming languages. Some comments must be made, however, concerning the nature of the scope rules that are used.

One of the most fundamental questions of the organization of scopes is, what is the basic area of scope restrictions? The most common answer is the one originated by ALGOL 60—the BEGIN-END block. This does appear, however, to be an excessively general structure. Multiple nested blocks with variables declared at each level begin to exceed the human mind’s limited capability for handling nesting. The alternative basis for scope restrictions is the procedure. Instead of allocating variables and controlling their accessibility with arbitrary blocks, such controls can be accomplished at procedure entry and exit. This approach has been in use for a long time (having been introduced in FORTRAN) but nevertheless, when coupled with local rather than static allocation, is quite workable, as witnessed by its modern use in PASCAL. Since it is somewhat simpler than BEGIN-END scope control, it can be recommended.

One somewhat more restrictive aspect of scope control is one that has been mentioned several times elsewhere in this section, namely, ensuring that one variable is not accessible by two different names. It has been suggested (Tennent, 1975) that the compiler should, in fact, check for such situations and consider them errors. In general, this is an idea with some merit; in practice, it could become rather difficult when constructs involving pointers are considered. Even if pointers cannot point to named variables, one can still have two pointers pointing to the same piece of anonymous storage. Recursive data types might clear up some of these problems.

Another aspect of scope which is a restriction that many compilers enforce is the requirement that entities (procedures, variables, etc.) must be declared before they are used. For variables and constants this is not too unpleasant. It may be desirable to provide some extensions to permit, for example, recursive data types or data types with pointers that refer to each other.

The biggest problem with the declare-before-use rule is for procedures. It really is desirable to be able to call a procedure which occurs later in the program, even if one does not have the classic situation of mutually recursive procedures. It is a serious nuisance to have to reorder the procedures in a program to suit the compiler rather than the programmer. Some effort in the compiler is probably justified to avoid having to introduce this restriction.

A final point which is a question of scope but is sometimes not recognized as such is the matter of the field names within a record. In some languages, notably COBOL, these names are accessible from the outside without further qualification. In many modern languages (e.g., PASCAL), "a.x" must be written "a.x" even if there is no other "x" in the program.

The choice of approach here is not obvious; the COBOL approach certainly can involve much less writing, but the PASCAL approach probably improves readability and certainly simplifies both the compiler and the language semantics. Certainly, if the COBOL approach is taken, it is vital that the compiler check for possible ambiguity; it must not be possible to have two different interpretations of the same name at the same place by rearranging the declarations.

Next post:

Previous post: