Programming-Language Design (Compiler Writing) Part 2

Programming Effectiveness

This topic is distinguished from the previous one, usability, by a rather fuzzy boundary. Roughly speaking, this subsection is concerned with the software- engineering aspects of language use rather than the convenience to the individual programmer.

A major concern in software engineering is the recording of the decisions made during program development. The program itself is the best place to record such decisions. The programming language should facilitate a clear statement of the programmer’s intentions. This implies, in particular, that the decision should not be obscured beneath the mound of details required to implement it.

Whenever possible, the language should allow programmers to state their wants and have the compiler do the implementation. Most constructs of high-level languages do this to some extent. In the all-too-frequent cases where the complexity is too great for the compiler or where the question of how to implement the decision is itself a significant decision, the language should allow the implementation details to be cleanly separated from the statement of the original decision. A few languages, such as LIS (Ichbiah et al., 1973) and Ada, have separate specification and implementation sections in their programs. The procedure notion provides this separation for executable code. Ideas as to how to achieve separation for things such as data structures are much less well developed.


In addition to stating decisions clearly, it is also desirable to be able to localize the effect of decision changes. Such changes are inevitable; it should not be necessary to rewrite the entire program when a change is required. This idea leads directly to the notion of abstraction, which is the concept that a programmer may make use of a particular notion (e.g., the data type "list") without needing to know the details of its implementation. This greatly simplifies the use of such predefined program segments, and is a key factor in controlling the complexity of programs. If one adds the restriction that the user not only does not need to know the implementation details of the abstraction but in fact has no way to use such knowledge, the implementation of the abstraction can then be changed without any alterations to programs using it.

It is obviously desirable for a programming language to support abstraction. Existing languages support it to some extent; the procedure again is the basic tool. Unfortunately, more powerful tools are needed to support, say, a "list" abstraction properly. This has been an active area of research (Liskov and Zilles, 1974; Horning, 1976; Shaw, 1976; SIGPLAN, 1976), and developments in it can be expected to be important to language design.

The matter of support for abstraction leads also to the question of support for various techniques of program construction, such as top-down programming. There are some problems in providing such support. In particular, while there are many different methods, new ones are still evolving, and no one method is clearly superior to all others. If a language is to be used with a specific programming methodology, definite attention should be given to making the language assist in the use of the method; otherwise, the designer should try to avoid enforcing any particular technique.

The final aspect of programming effectiveness to be discussed is that languages should discourage trickery. Cute, clever programming tricks are highly detrimental to readability and are seldom justified. Although the language cannot make such things impossible, a language which is clear, readable, and straight for ward will at least hinder such activities. In particular, the language designer should be aware that certain language features encourage trickery. Such features are generally characterized by having a legitimate purpose but being overly general for that purpose. This leaves the way open for "clever" programmers to try to squeeze as much activity into the feature as possible. The best-known example of this is the operator-priority structure and powerful, complex operators of APL, which lead to the infamous "one-liner" when their capabilities are abused. Another example of bad operator-precedence design is in C (Kernighan and Ritchie, 1978), which has approximately one dozen priority levels.

Compilability

A language must be compilable. Less obviously, it should be as simple as possible to compile subject to no major reduction in its communicative effectiveness and usability. Compiler writing is already a complex job, and any increase in its complexity may make the task unmanageable in the sense that the finished product will be bigger, more complex, and hence less reliable.

Complexity can be introduced in both the analysis and synthesis phases of the compilation process. Languages which demand a significant amount of context in order to be parsed successfully are difficult for compiler writers whether they are using an ad hoc parsing technique or a more formalized table-driven parsing method [e.g., a simple precedence (Wirth and Weber, 1966), extended precedence (McKeeman, 1966), LR(k) (Knuth, 1965; DeRemer, 1969), or LL(k) (Rosenkrantz and Stearns, 1970) method]. This type of complexity is commonly introduced in languages in which the comma and/or parenthesis are used to fulfill many different roles. For example, a parenthesis may be used to group subexpressions, enclose arguments in a procedure call, enclose parameters in a procedure definition, surround the subscripts of an array reference, and set off logical expressions, ON-unit arguments, and file names—all in one language called PL/I. A parser look-ahead of two or three may be required, thus slowing down the compiler tremendously and adding to the size of the parsing table.

The generation of code can also be made more difficult if overly complex language features are needlessly included—especially if they are rarely used and/or can be simulated by simpler constructs. The infamous ALGOL 60 call-by-name (or call-as-if-by-macro) parameter passing feature exemplifies such a difficult-to-compile construct. It necessitates the generation of a section of object code [often called a "thunk" (Ingerman, 1961)] for each argument. This code must be reevaluated at run time each time the corresponding formal parameter is referenced. Practice has shown that simpler schemes, such as call-by-reference and call-by-value, are sufficiently powerful parameter-passing methods.

In addition to problems in compilation, some thought should be given to debugging aids, measurement tools, and other aids to program testing. Such facilities are quite important to users of a language if programs are to be verified properly. These notions are elaborated on in Sec. 3-5.5, Compile Structure.

As much as possible, attempts should always be made to complete the language design before entering into a compiler implementation. Often, however, the compilability of a language is not entirely appreciated until the implementation stage. In this sense, compilability violates the ideal of having distinct design and implementation stages more than any other program-language design goal.

Efficiency

"More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason—including blind stupidity." (Wulf, 1972a).

Efficiency has been the most excessively overemphasized topic in the history of programming-language development. Is there justification in producing a program that executes twenty times as fast as the competition—but fails in half the runs? Efficiency is important after, not before, reliability.

Efficiency must be considered in the context of the total environment, remembering that machines are getting cheaper and programmers are getting more expensive. For example, it is generally undesirable to spend a significant effort to improve a program’s efficiency by 10 percent because the savings simply do not justify the investment.

It is-also worth remembering that efficiency is not just speed—space, I/O access, and paging patterns are also involved in efficiency. Space inefficiency in particular can be just as important as time inefficiency, and it is becoming more important in light of recent developments in minicomputers and microcomputers.

While it has certainly been overemphasized, efficiency cannot be ignored entirely. The pronouncement "we don’t need to worry about efficiency anymore because the hardware will be so cheap that we needn’t bother" is partly true—but only partly. Differences in efficiency of 10 or 20 percent, or even 30 or 40 percent, can be tolerated. When the difference is a factor of 2 or a factor of 10, the situation is no longer tolerable, for the following reasons (Hoare, 1973):

1. However cheap and fast the hardware is, it is cheaper and faster when running an efficient program.

2. The speed and cost of peripherals are not improving anywhere near as quickly as those of CPUs.

3. Users should not have to sacrifice readability and reliability in order to get their money’s worth out of the improved hardware.

This being the case, it is necessary to determine the main causes of inefficiency. The biggest single cause of inefficiency is a mismatch between the hardware and the language. Sometimes this indicates revisions to the language, usually to drop features that the hardware cannot support efficiently.

Trying to match a language too closely to the hardware can result in a rather bad language. The best example of this is the FORTRAN IV restrictions on array subscripts. These restrictions adapt FORTRAN admirably well to the addressing hardware of the IBM 709, but this no longer constitutes an advantage, and the subscript restrictions are certainly among the more irritating features of FORTRAN.

Another degradation of efficiency arises from "booby-trapped" constructs—innocent-looking constructs that execute incredibly inefficiently. This is really a special case of the mismatch problem mentioned earlier, but it deserves explicit discussion. The language designer should strive to avoid situations where a minor change to a program produces a massive change in efficiency. Such gross unpredictability will understandably upset users. The speed and space requirements of a construct should be roughly reflected in its size and complexity as the programmer sees it—more expensive operations should require more writing. To some extent, this is another application of the "no surprises" rule mentioned earlier.

Accepting that inefficiency can be a problem, what are the potential cures for it? The one that is invariably presented is extensive optimization in the compiler (see Chaps. 12 and 13). Optimizing compilers can be very helpful, particularly where the scarce resource is space. In addition, specific features can also help (e.g., a GOTO-less language will require much less compiler effort for analysis of the control flow paths).

The optimizing compiler, however, cannot be presented as the whole answer. A good optimizing compiler is difficult and time-consuming to write. Because of larger size and greater complexity, it will probably be less reliable than a nonoptimizing compiler. Great care must be taken that the semantics of the language are not changed by optimization. Finally, optimizers generally run much more slowly than simpler compilers, which can be a surprisingly significant factor even at installations which think they spend most of their time on production runs.

One much-neglected way of easing the efficiency problem is that of giving optimization advice to the compiler. Although it is not always clear exactly what form much of this advice should take, a few significant ideas are available which can make a great deal of difference. An order-of-magnitude speed improvement can result from being able to advise the compiler that certain variables will be heavily used and should receive special attention regarding access time (such attention might take the form of placing them in registers rather than main storage). Sizable space improvements can result from being able to advise the compiler that a large data structure should be packed as tightly as possible regardless of access time.

Finally, significant improvements in efficiency can be obtained by making the language simple enough that it is easy to generate reasonably efficient code. PASCAL (Jensen and Wirth, 1975) is a good example of this. PASCAL omits most of the less efficient constructs, and the constructs it does have are simple enough and few enough that a modest amount of effort can produce highly efficient code.

Machine Independence

Although one of the original hopes for high-level languages was that they would be machine-independent, this hope has not been fully realized. Most current languages remain strongly machine-dependent, and the transportability of programs suffers as a consequence.

Before machine independence is discussed, the definition of machine-independent must be clarified. A workable definition is as follows: a language is machine-independent if and only if a program which has been compiled and run correctly on machine X will, when transported to machine Y, run with exactly the same output given the same input.

Note that the fact that a language is not machine-independent does not necessarily rule out transportation of programs. Often a subset of the language will be machine-independent (for example, a subset not including floating-point arithmetic). Even in a language that is very machine-dependent (e.g., one that includes floating-point arithmetic or a " word" data type), it is often possible to avoid machine dependencies enough to make specific programs transportable.

Most machine dependencies result from specifying parameters in terms of a particular machine rather than in terms of the actual dimensions involved. The best example of this is the specification of integer variables. Most languages just say "integer," and there is no indication of what sort of precision is needed. COBOL and PL/I specify the number of digits required. The simplest approach is to specify the smallest and largest numbers that will be contained in a particular variable, as is done in PASCAL.

PASCAL, in fact, is certainly worth studying as an example of how to make a language nearly machine-independent without pain. Other significant languages that come close to PASCAL in machine independence are FORTRAN and COBOL; however, their machine independence has come about as a result of efforts in standardization rather than through original language design.

One area of extreme machine dependence is floating-point arithmetic. The fundamental cause of this is that there is, as yet, no sufficiently general way to define the precision of numbers and operations in a manner independent of the implementation. However, IEEE has proposed a standard for floating-point numbers and, hopefully, this will be incorporated in future language design efforts.

Similar troubles arise in connection with the character set. The exact bit patterns representing a given character are not usually relevant, but the collating sequence of the characters varies enough from machine to machine to make machine independence very difficult. It is usually safe to assume that

1. Digits are in order.

2. The uppercase alphabet is in order.

3. The lowercase alphabet is in order.

4. The blank precedes any printable character.

Note that "in order" means ‘a’ < ‘b1 but does not forbid the existence of x such that ‘a’ < x < ‘b’.

Nonprinting "control characters" also cause problems, especially for I/O. One aspect that is quite difficult to make machine-independent is the question of how the end of a line is marked in character I/O. Some machines use a " newline" character, while others rely on a fixed length or a length count. An approach that is often adopted is to include line-feed control as part of the I/O primitives even if the machine uses a "newline" character. Other I/O questions, such as direct-access files, are often so machine-dependent that any language incorporating special constructs for them is destined to sizable machine dependencies.

It should be pointed out that machine independence is not required for all languages. In particular, languages intended for writing operating systems or other specialized machine-dependent software will inevitably have a significant ^degree of machine dependence. For some application-oriented languages (e.g., real-time control languages), it is questionable whether it is necessary to remove all machine dependencies.

Simplicity

Simplicity is a major question in the design of any language. It is clear that most users prefer a simple language. In fact, a language which fails to be simple will generally fail in many other respects, as attested to by the lack of overwhelming support for PL/I and ALGOL 68. While the lack of simplicity can result in an unaccepted language, simplicity by itself does not imply a good language. BASIC is certainly a simple language, but it is not particularly well designed and most versions should be avoided, if possible, for any type of serious software development.

On the basis of the history of programming languages, one or two statements can be made about the best way of achieving simplicity (Wirth, 1974). Simplicity is not achieved through lack of structure; the result of this is chaos. Simplicity is not achieved through limitless generality; the result of this is a language that is impossible either to truly master or to completely implement.

Simplicity is best achieved through restriction of objectives, great care as to readability and obviousness, and the basing of the language around a few well-defined, simple concepts.

Uniformity

Uniformity is defined as "doing the same thing the same way regardless of context." If adopted as a language-design principle, it can help in reducing the number of things programmers have to think about at one time, since they do not need to consider context to know how a feature will act.

Like several other aspects which have been considered before, uniformity can be useful or useless depending on how it is applied. An example of useful uniformity is the notion in ALGOL of the expression; anywhere you need an arithmetic value, you can use any expression. In general, useless uniformity is having a construct behave the same way in two contexts that are sufficiently different that the programmer does not expect identical behavior and has no use for it. A possible example of useless uniformity is the merging of the ideas of "expression" and "statement" in ALGOL 68. With the single and somewhat debatable exception of using an IF in an expression, this construct is of questionable use in most programs.

A subtopic of "uniformity" that deserves consideration is the matter of eliminating special cases. Doing something one way most of the time and then another way in one particular case (possibly for implementation reasons) should be avoided. A good example of this is the method used to assign a value to a label variable in PL/I—it is completely different from the method used in assigning a value to an element of a label array. Such special cases have three specific problems:

1. They complicate the language.

2. They lead to implementation differences, since some people will implement them faithfully and others will "improve" the obviously bad design by eliminating them.

3. They tend to become "hardened" and cannot be removed even after they have outlived their usefulness, since this would invalidate existing implementations.

Point 3 is illustrated by the fact that we are still living with the IBM 709, as hardened into FORTRAN.

Orthogonality

Orthogonality is another design philosophy, prominent largely because of its extensive use in ALGOL 68. The basic idea of orthogonality is to have several general concepts, each of which functions on its own without knowledge of the inner structure of the others. Good examples of this from ALGOL 68 are value accessing (e.g., simple variables, array accesses) and arithmetic operators (e.g., " + "). The value-accessing primitives yield a value; they do not care how it is used. The arithmetic operators combine two values; they do not care how the values are obtained.

Although orthogonality can result in considerable improvements in simplicity, the example of ALGOL 68 provides a warning: it is doubtful if orthogonality is useful as a substitute for simplicity. The additional fact that highly general orthogonality seems to be harder to achieve than simplicity supports the arguments against concentrating on orthogonality to the exclusion of all else.

The alternative to orthogonality (which has sometimes been characterized as "provide all features everywhere") is sometimes called "diagonality" ("provide features where needed or useful"). For example, the orthogonality of value access and arithmetic, mentioned above, is unquestionably valuable. However, it is not so clear that it is valuable to extend this orthogonality to not caring whether the values are simple values or arrays. The difficulty of a straightforward extension to arrays is that things like:

tmp4C8-6_thumb[2]

have implementation-dependent results. Here is a case where a bit less orthogonality (i.e., a bit more "diagonality") seems called for. The operator structure that works so well for simple scaler operands simply does not extend cleanly to arrays. Note that this does not rule out all array operators; assignment and comparison are still useful and straightforward.

Generalization and Specialization

The basic philosophy of generality is " if we allow this, then let’s allow everything similar as well." In a manner similar to orthogonality, this notion is an aid to simplicity but not a substitute for simplicity. If carried too far, generality can result in rarely used, error-prone features which can be difficult to implement. The implementation of strings in ALGOL 68 provides an example. A proper implementation of strings allows variable, unbounded length. The designers of ALGOL 68 decided to generalize this; ALGOL 68 allows the bounds of " flexible" arrays to change at any time and makes strings just a special case of this. As it turns out, flexible arrays are an implementer’s nightmare and are of marginal use except to provide strings. An alternative approach is simply to define "string" as a primitive as some ALGOL 68 subset implementers have done.

One other special case of generality versus specialization is worth mentioning. Many languages have built-in functions which take any number of arguments, whereas it is very seldom that user functions can do this. Only I/O functions and a few special functions like MAX and MIN really need variable numbers of arguments. In most instances, users do not need variable-length argument lists for their own procedures if the language already provides these special cases.

Other Design Philosophies

Language modularity is the basic philosophy advanced in PL/I; the idea is that users learn a subset appropriate to their own needs so that no one needs to memorize the entire (immense) language. This notion fails as a substitute for simplicity in that it works well only if programmers never make a mistake. When they do make a mistake, they may well have to understand the entire language to decide what happened and why, because the mistake may very well cause activation of features unknown to them.

On a somewhat smaller scale, the use of defaults often achieves a kind of modularity in which users are effectively working in a sublanguage which is better suited to their needs. This is often a useful feature, provided that the user does understand both the whole language and the structure of the defaults. This normally requires that the defaults be used only in cases where they are absolutely self-evident.

Minimality is the notion that the language should contain the absolute minimum number of constructs with which it can possibly survive. This is generally a good philosophy, but it fails if carried too far. A Turing machine is very minimal, but you would not want to program one. It is important to distinguish between a minimal set of constructs, which is the smallest possible set, and a minimal usable set, which is the smallest set that still satisfies the basic requirement of usability reasonably well.

An example of where the "minimal set of constructs" doctrine is applied most frequently is in connection with control structures, usually with reference to a set which does not contain the GOTO. It can be shown that sequential flow and the WHILE loop are sufficient to generate all other constructs (even the IF), but such a minimal set is far from usable.

Programming-language designers should generally attempt to have the constructs of their language come from more or less the same level of abstraction. The brief discussion to follow illustrates this idea under two headings: avoiding constructs which are too primitive to fit well into the language, and avoiding constructs which are too high-level to fit well.

A language which is reasonably well removed from the liabilities of the hardware should not contain one or two constructs that bring them back again. For example, to avoid reintroducing the hardware "branch" (i.e., the GOTO), provide a reasonably good set of high-level control primitives. To avoid reintroducing that troublesome construct, the unrestricted pointer, either put some restrictions on it like restricting it to point only to objects of a specific type or do away with it completely by implementing something like Hoare’s recursive-data types (Hoare, Dahl, and Dijkstra, 1972). To avoid reintroducing the "word" in a situation where a word contains, say, a 12-bit count and then 4 flag bits, arrange the storage-allocation strategy of your compiler so that

tmp4C8-7_thumb[2]

results in a desired storage assignment and full type checking is retained.

It is also generally undesirable to have constructs from a higher level of abstraction included. This is not so much because they clutter up the language as because it is hard to introduce a reasonably general set without defining essentially an entire new language. The users will want to know why, when you included a primitive that sorts a file, you could not include a similar primitive to sort an array. Under special circumstances, it may be necessary to include a few such constructs. For example, even in a language that supports strings only as fixed-length vectors of characters, it is useful to have literal string constants such as "’a string’" available as a constant of type "array of character." Generally, however, such constructs only make the users more aware that they do not have an entire language at the higher level of abstraction.

DETAILED DESIGN

This section discusses, on various levels and from several viewpoints, some of the details which go into a language. Some areas of language design have been investigated well enough that it is possible to lay down fairly concrete guidelines; this is done whenever possible. However, many topics are not yet at such a stage, and an attempt is made only to give general discussion on such matters.

The language designer reading this section should recognize that most of this section presents consensus opinions rather than hard facts. Designers should, by all means, feel free to follow different advice or to strike out on their own (see, however, Sec. 3-3, and Hoare, 1973, on the dangers of too much inventiveness). This section is intended to provide a baseline for those who intend to strike out, and a foundation for those who just need to put a language together and do not wish to innovate extensively.

The section is organized on the basis of the various types of structures a program exhibits, and emphasis is placed on how these structures are influenced by the programming language in use. In this context the implications for language design are discussed. The structures are ordered essentially in a bottom-up sequence, working from lowest level to highest level.

Microstructure

What is here called microstructure basically covers those issues of language design which affect the appearance of the language but do not change its semantics. The most significant principle of language microstructure is that the meaning of a construct, such as an operator, should be obvious from its appearance. In other words, the tokens of the language should be easily recognized for what they are and what they do.

The most conspicuous and lowest-level aspect of microstructure is the character set used. Too large a character set can result in users puzzling over the meanings of the obscure hieroglyphics in a program. Too small a character set often results in rather strained syntax, as witness the overuse of parentheses in PL/I. The character set should be as standard as possible to avoid changing programs or the language itself when moving between machines.

In general, the best character set appears to be the 7-bit ASCII set. This is the English-language version of the ISO International Standard Character Set, and is the de facto standard for most of the industry. It is missing perhaps half a dozen useful characters, but in general it is suitable for most programming languages. The 8-bit EBCDIC character set generally has about the same assortment of available characters but suffers from the existence of several versions and heavy dependence on specific manufacturers.

It is likely that the 48-character set, which was used in earlier years of the computing industry, will be phased out. It is clearly difficult to attempt to implement the programming languages of today on this small character set.

To be readable to a compiler, a program must be entered into the machine in some manner. The punched card \s no longer a heavily used input medium. The current trend is toward on-line entry, either onto magnetic storage media for later transfer to the computer or, increasingly, directly into the computer via time-sharing terminals or work stations. Quite often the programmer does the typing if time sharing or work stations are used; most other methods employ data-entry specialists. Whatever system is in use, programs should be easy to type.

Some general principles can be set down for easy typing. When using multicharacter nonalphanumeric symbols, try to avoid combinations which require the typist to press the shift key for one character and release it for the next; such sequences slow down typing dramatically. The best multicharacter symbols are those which repeat the same character twice, rather than using two different characters.

Keyboards differ, and hence if an installation has several different terminal types, they quite possibly all have different keyboards. This is significant if programmers do their own typing, since it is hard to become familiar with one keyboard if you have to take whichever terminal is free. The major result of this is that even programmers who can touch-type have to look at the keys for characters other than the letters, the digits, and the following special symbols: ,./?(). Some of the other character placements are nearly standard, but that is not good enough. Try to use keywords or combinations of the above characters for frequently used symbols.

Ideally, it is easier to type short symbols rather than long ones (such as keywords). This is not, however, a valid reason for using one obscure character when a well-chosen keyword would be much clearer.

Another factor which affects the general choice of symbols is that the length of a given symbol should reflect certain semantic characteristics. For operator symbols, lower-priority operators should have longer symbols (it is surprising how much difference this makes to readability).

It is generally a good idea to use a symbol for only one purpose, especially if it is an operator. It is regrettable, from the viewpoint of parsing and error checking, that most programming languages carry over from mathematics the convention of using " – " for both subtraction and unary negation. If the designer feels like experimenting, perhaps "NEG" could be used for negation.

When a suitable combination of nonalphanumeric characters is not self-evident, a keyword must be chosen for a symbol. Certain general principles can be laid down for the choice of such keywords.

Keywords should be pronounceable. It is easier to remember a syllable than a sequence of unrelated, unpronounceable letters.

Keywords generally should be chosen so that they are unlikely to duplicate user-defined identifiers. Even if there is some means of distinguishing the two, the potential for confusion is high. When choosing keywords, try to use verbs and prepositions; programmers tend to use nouns and adjectives as variable names.

Keywords should be spelled the way the user would expect them. If you must use "SIZE" as an operator, do not spell it "SIZ." Such peculiar spellings will cause more trouble than they save. If at all possible, do not have two keywords with very similar spellings (for example, "PROGEND" and "PROCEND").

Abbreviate keywords only when the abbreviation is absolutely self-evident. This saves considerable typing and is frequently used. Abbreviating "PROCEDURE" to "PROC" or "PREDECESSOR" to "PRED" is certainly justified; abbreviating "EXTERNAL" to "EX" or "REAL" to "RE" is not. Abbreviating "INTEGER" to "INT" is probably justified.

Sometimes it will be necessary to have a pair of keywords bracketing something. A popular practice in recent years, notably by ALGOL 68 (Lindsey and van der Meulen, 1972; Van Wijngaarden et al., 1975), has been to reverse the spelling of the leading keyword to get the trailing keyword. This works some of the time,’but many of the results are horrible. It is probably best to use only the following pairs, which are gaining wide acceptance:

IF

FI

CASE

ESAC

DO

OD

If you have to end a construct which starts with "SELECT," use "ENDSELECT," not "TCELES." Another alternative, which is harder to design, is to try to find a keyword which "fits" as a trailing keyword (example: "LOOP… REPEAT"). It is also possible to use a pair of keywords (example: "LOOP… ENDLOOP") instead of a single keyword, if no serious confusion is introduced elsewhere. Also, try to avoid overusing a keyword; the PL/I "END" is a gross violator of this.

One further matter that is important with respect to keywords is how to tell them apart from user-defined identifiers. There are three distinct approaches:

1. Keywords are "reserved" ancl may not be used as identifiers.

2. Keywords are distinguished from identifiers on the basis of context.

3. Keywords are preceded by some special character to mark them.

Generally, alternative 1 works best—it is simple and, given a careful choice of keywords, seldom causes trouble. Alternative 2, that used in PL/I, significantly complicates the parser. It probably was adopted in PL/I because the number of keywords in some implementations was so large that no user could reasonably be expected to avoid them all. Languages of a more realistic size do not have such problems. Alternative 3 is often used in ALGOL 60 and ALGOL 68 implementations. It involves extra typing and makes the program unreadable. The most successful single implementation of ALGOL 60 (the Burroughs one) uses alternative 1.

User-defined identifiers deserve some discussion. The usual basic set is letters plus digits, except that the first character must be a letter. If both uppercase and lowercase letters are available, they should be considered equivalent. There are few things more confusing than having "A" and "a" acting as separate variables. It has become customary to add a few extra characters to the list of "letters," for convenience. Probably the most important one is the underscore "_", which allows, in effect, the use of "blanks" inside identifiers. This provides a major improvement in readability.

With modern string-handling techniques there is no justification for limiting the length of identifiers, either explicitly ("maximum 6 characters") or implicitly ("only the first 6 are significant"). This approach does not require unlimited storage, as few identifiers will be longer than about 15 characters (Alexander, 1972). A "silent limit" of, say, 255 characters will never be exceeded or even approached, especially if identifiers cannot be split across line boundaries.

Another aspect of microstructure which deserves careful attention is the design of the comment convention. It must be brief (not "COMMENT"), and it must be easy to type (not "/*"). As Hoare points out (Hoare, 1973), probably the best type of comment convention is one that is fairly common in assemblers but uncommon in high-level languages: a particular symbol starts a comment, which then extends to the end of that line. This eliminates the "runaway" comments of PL/I, and this convention turns out to be surprisingly convenient to use. A significant question is the choice of a beginning symbol. While various symbols have been used for such comments, the requirement of easy typing may eliminate many of them. Also, such one-character symbols are often useful as operators. Ideally, such a symbol should:

1. Be a two-character symbol, preferably both the same character.

2. Be a symbol seldom, if ever, used as an operator, with no obvious meaning as such.

3. Be composed of characters located in the same place on all keyboards.

The symbol "//" is an example of a good choice.

On the matter of program appearance, most languages have followed ALGOL’S lead in declaring that the blank is significant only in that (except within strings) it separates two symbols and may not occur within a symbol. If the system has a "tab" character, it is useful to treat it the same way. It is also reasonable to treat end-of-line the same way, except that it also may terminate comments. It is most undesirable to require constructs to start in certain columns.

One final matter concerning the appearance of the language: unless it is planned to use an automatic source-text formatter as part of the compiler.If the system has a form-feed character available, it is a simple matter to arrange for the compiler’s scanner to ignore such characters as tokens.

Expression Structure

The expression is often the fundamental unit of computation in a language. Its components, operators and value accesses, are too peculiar to specific languages to be given much treatment here, but some comments can be made on expressions in general.

One topic related to expressions is the question of order of evaluation. The normal method of determining order of evaluation is based on two levels: explicit bracketing and operator binding.

Explicit bracketing comprises both parentheses and the overall bracketing provided by the boundaries of the expression. In this connection, it is worth noting an idea a few languages (such as LISP) have: having two types of parenthesis constructs, "[ ]" and "( )", so that complex expressions become more readable.

Operator binding is the most familiar aspect of order of evaluation. Three basic systems of binding exist: left-to-right, right-to-left, and priority. Left-to-right cannot be recommended; the only reason it is used at all (quite frequently in assemblers) is that it is simple. Right-to-left is notable only because it is used in APL. It is used mainly because of the wide variety of operators, which makes it difficult to try to assign precedences to them all. An alternative approach for a language with lots of operators is to assign priorities to some of the simpler operators and require explicit bracketing of the rest. Experiments in language design (Gannon, 1975) have indicated that APL’s "natural" right-to-left rule is in fact harder to understand than simple priority rules.

One thing that must be avoided is large numbers of priority levels, as in C, which can confuse rather than illuminate. Restriction to a small number of priority levels can best be done by considering which operators are, psychologically speaking, on different levels.

The desire to reduce the number of levels should not lead the designer into placing, on the same level, operators which are psychologically on different levels. For example, the decision in PASCAL to put "AND" on the same level as " * " and "OR" on the same level as " + " may have been a mistake: simple multiple comparisons such as "a = b AND c = d" are illegal; this must be written "(a = b) AND (c = d)".

A minor aspect of operator binding is the relationship of unary and binary operators. In many languages tfyis is complicated considerably by the fact that not all unary operators have the same priority; NOT is often given a very low priority. While there is some justification for this (to put the logical operators all in the same range of priority), it is probably simpler to take the approach of ALGOL 68 (Van Wijngaarden et al., 1975): all unary operators have a priority higher than all binary operators.

A matter related to expression structure is the much-praised notion of the "expression language," the idea that every construct returns a value and may be used anywhere a value is required. At first sight, the concept looks appealing. The whole idea of the "statement" vanishes. It becomes possible to use an IF or CASE to select which of several values should be used at a particular point in an expression. Unfortunately, a deeper examination shows that the loss of "statement" is more than counterbalanced by the additional complexity in the behavior of "expression." Furthermore, all but the simplest cases of using a complex construct inside, say, an arithmetic expression become unreadable. It is also very unclear just what value should be returned from, for example, a loop. If users are allowed to supply their own values, this makes things worse. The expression language is a novel idea, but in reality it appears to increase complexity, widens the field for trickery and cleverness at the expense of readability, and does not add any particularly useful capabilities.

The one possible exception to this is the conditional expression: a simple IF-expression which chooses one of two values. This is reasonably easy to understand and is sometimes useful; arguments against it are the infrequency of use and the difficulty of arriving at a concise yet readable syntax.

The alternative to an expression language is the statement language, where a distinction is made between statements and expressions. Control constructs are statements and are not valid expressions. Expressions yield a value and generally, barring the misuse of function calls, do not have side effects. This type of organization corresponds well to the way programmers normally think about their programs: a "statement" performs an action, an "expression" yields a value.

Next post:

Previous post: