Scanners (Compiler Writing) Part 2


In Sec. 2-3, the grammar classification of Chomsky was introduced. In this section, one class of these grammars, the 7"3 or regular grammars, is discussed. Regular expressions are also described in this section, and an algorithm for converting a regular grammar to a regular expression is given. Both formalisms can be used to describe tokens.

Recall from topic 2 that a regular grammar is defined to consist only of productions of the formtmp4C8-50_thumb[2]has the form aB or a wheretmp4C8-51_thumb[2]Furthermore, the languages generated by such grammars are said to be regular. Examples of regular grammars and regular languages were presented in Chap. 2.

A more compact way of representing regular languages is with the use of regular expressions. The equivalence of regular grammars and regular expressions will be demonstrated in the next two sections when proofs of the equivalence of regular grammars and finite-state acceptors and of regular expressions and finite-state acceptors are given.

Regular expressions make use of three operators: concatenation, alternation, and closure. Assume that the two expressions ei and e2 generate the languages and L2, respectively. Concatenation is then defined astmp4C8-52_thumb[2]and tmp4C8-53_thumb[2]Alternation, which is denoted by either | or +, is the union of the languages denoted by two expressions. Hence,tmp4C8-54_thumb[2]

Closure, which is represented by the braces { }, denotes the repetition of the expression zero or more times. Thus,tmp4C8-55_thumb[2]

These operators are analogous to the BNF operators introduced in Chap. 2. For example, the expression 110 consists of the digits 1, 1, and 0 concatenated together and denotes the language L = {110}. The expression 0|1 denotes the language L = {0,1} while the expression {1} denotes the language L = |1′| / = 0,1,2,…}.

Other examples of regular expressions that specify familiar sets of tokens follow.


Note that the terms letter and digit represent the sets of alphabetic letters and decimal digits, respectively.

Using the definitions of the above operators, a formal definition of regular expressions is now given.

Definition 4-1 Regular expressions are those expressions that can be constructed from the following rules:

1. $ is a regular expression denoting the empty set.

2. e is a regular expression denoting the language consisting of only the empty string, that is, {e}.

3. a, wheretmp4C8-63_thumb[2]is a regular expression denoting the language consisting of the single symbol a, that is, the language {a}.

4. If el and e2 are regular expressions denoting the languages Lx and L2, respectively, then:

(a)tmp4C8-64_thumb[2]is a regular expression denotingtmp4C8-65_thumb[2]

(b)tmp4C8-66_thumb[2]is a regular expression denotingtmp4C8-67_thumb[2] .(c)tmp4C8-68_thumb[2]is a regular expression denotingtmp4C8-69_thumb[2]

If the precedence of the operators in regular expressions are defined with closure ({ }) having the highest precedence, concatenation with the next highest, and alternation (|) having the lowest precedence, the parentheses can be eliminated whenever possible. Thus, ((/?)!((/>)(?))) would have the same meaning if the parentheses were omitted. In the case of {p\q)r, the parentheses cannot be removed, as the language denoted by the expression would become different.

Some examples of regular expressions are now given. The expression {a} {b} represents the languagetmp4C8-70_thumb[2]This expression can be compared with the expression {ab} which generates the languagetmp4C8-71_thumb[2]and the expression {a\b} denoting the languagetmp4C8-72_thumb[2]The expression {aa\ab\ba\bb} denotes all strings overtmp4C8-73_thumb[2]of even length. This can be seen by observing that the expression aa\ab\ba\bb denotes all strings of length two with zero or more combinations of the strings generated by this expression producing strings all of an even length.

It is convenient to define the equality of regular expressions. Two regular expressions are equal (=) or equivalent if they denote the same language. Thus, 0(0} = 00(0} 10 as each expression generates the languagetmp4C8-74_thumb[2]

The tokens of a programming language can be defined in terms of regular grammars and regular expressions. For example, an identifier which consists of letters and digits and must begin with a letter can be described by the regular grammar


and by the regular expression


Using letter as a shorthand representation for a \ b \ c| … |z and digit for 0111 … 19, the expression might be written as letter{letter| digit} to improve readability. As we shall see in Sec. 4-7, regular expressions are often used to describe tokens for a scanner generator.

For the remainder of this section, a method for converting regular grammars to regular expressions is presented. At this point, however, it should be noted that the regular-expression operators obey some algebraic rules. For example, both alternation and concatenation are associative, with alternation also being commutative. Thus we can write (ab)c = a(bc), (a\b)\c = a\(b\c), and a\b = b\a. Distributivity also applies, with concatenation distributing over alternation; that is, a{b\c) = ab\ac. Proofs of these identities are left as exercises. Finally, regular expressions can be used in equations and therefore evaluated. For example, A = {aA} is a valid regular-expression equation.

To present a method of converting a regular grammar to a regular expression, consider the following regular grammar from Chap. 2 which generates the languagetmp4C8-89_thumb[2]


Replacing the production operator (-ยป) with an equal sign and combining all possible productions from a given nonterminal into one expression by using the alternation operator, the grammar can be written as the set of equations


By solving this set of equations, a regular expression with only terminal symbols is derived which generates the same language as the original regular grammar. Solutions for equations which are defined only in terms of themselves should be done first. In this example, the equation for C has a solution of C = {a}a. This can be shown by substituting this solution into the equation for C, giving


Factoring yields


Now,tmp4C8-95_thumb[2], since e is in both sides of the equation, a is in both sides, and any string a’ is in both sides.

The solution for C can be substituted into the second equation, and the third equation can be dropped as it is no longer needed. This results in the set of equations


We immediately have a solution for B which can be substituted into the first equation. This results in the equation


It can easily be shown that a solution for S is S = {a}ab{a}a, which is a regular expression generating the same language as the original regular grammar.

An algorithm which generates an equivalent regular expression from a regular grammar is now given.

Algorithm REGULAFLEXPRESSION. Given a regular grammar with the set of productions of the formtmp4C8-100_thumb[2]is of the form aXj or a wheretmp4C8-101_thumb[2] this algorithm generates an equivalent regular expression. Xl is the starting symbol of the grammar and there are n terminal symbols in the grammar.

1. [Convert the regular expression equations]

Repeat for each productiontmp4C8-102_thumb[2]of the regular grammar

If the equation X, has not been initialized then Definetmp4C8-103_thumb[2] else Changetmp4C8-104_thumb[2]where a is the previously defined part of the equation X,

2. [Convert equations to required format]

Repeat for /’ = 1,2,…,n – 1

Convert equation X, into the formtmp4C8-105_thumb[2]where is of the formtmp4C8-106_thumb[2]and a, and eachtmp4C8-107_thumb[2]is a regular expression overtmp4C8-108_thumb[2] Repeat fortmp4C8-109_thumb[2]


3. [Solve for the solution]

Repeat for /’ = n,n – 1…..1

Convert the equation for X, into the formtmp4C8-111_thumb[2]wheretmp4C8-112_thumb[2]is a regular expression overtmp4C8-113_thumb[2] Repeat for j = i – 1, i – 2,…,1 Substitute the solutiontmp4C8-114_thumb[2]into the equation for Xt


Step 2 then converts each equation to the form


While some algebraic manipulation may be required, it is possible for each equation to be converted to this form. Step 3 uses back substitution to solve the set of equations. The solution for Xn is easily found, as this equation has the form tmp4C8-134_thumb[2]and it istmp4C8-135_thumb[2]This solution is then substituted into each preceding equation. Once a solution for equation X, has been found and substituted into the preceding equations, the equation fortmp4C8-136_thumb[2]has only the unknowntmp4C8-137_thumb[2]in it, and the equation can be written in the formtmp4C8-138_thumb[2]

Thus a solution fortmp4C8-139_thumb[2]is easily computed. At the completion of step 3, the regular expression for X1 is the regular expression equivalent to the regular grammar which the algorithm started with. A trace of this algorithm is left for the exercises.

In the next section, finite-state acceptors are introduced. Finite-state acceptors are theoretical machines which can be used to determine whether a given string can be generated from a regular grammar or a regular expression. In closing, a scanner generator based on regular expressions is discussed in Sec. 4-7.


In this section we continue the discussion of finite-state acceptors introduced in Sec. 4-2. Scanner generators often create a scanner which simulates a finite-state acceptor, as it is not difficult to program an acceptor on a computer. Three types of finite-state acceptors are examined: deterministic, nondeterministic, and non-deterministic with e-transitions. The equivalence of these types of acceptors is also demonstrated.

Deterministic Finite-State Acceptors

A deterministic finite-state acceptor (DFA), which is also known as a deterministic finite automaton, is an acceptor which for any state and input character has at most one transition state that the acceptor changes to. If no transition state is specified, the input string is rejected.

The following definition formally introduces a DFA.

Definition 4-2 A deterministic finite-state acceptor (DFA) is a 5-tuple (K, VT, M, S, Z) where:

K is a finite, nonempty set of elements called states. VT is an alphabet called the input alphabet. M is a mapping from K X VT into K.

tmp4C8-149_thumb[2]is called the initial state or starting state. tmp4C8-150_thumb[2]is a nonempty set of final states.

The DFA is initially in state 5 and reads an input string from left to right. The mapping function M of the DFA defines the state transitions and is denoted by M(Q, T) = R, where Q and R are states of K and T is a character from the input alphabet. The mapping function indicates that when the acceptor is in state Q and T is the next input character, the acceptor changes to state R. The following two definitions complete the specification of the mapping function:


The first definition implies that a DFA cannot change state without reading a character from .the input alphabet, while the second definition is a recursive definition showing that when the DFA is in state Q with some input string x = Tt, the mapping M(Q, T) is applied first with the result P = M(Q, T). Then the mapping M(P, t) can be applied. This definition extends the applicability of the mapping function to strings over rather than just elements of VT.

A sentence t is said to be accepted by a DFA if M(S, t) = P for some DFA F = (K, VT, M,S,Z) such thattmp4C8-154_thumb[2]andtmp4C8-155_thumb[2]The string t is accepted by the DFA if, after reading the entire string, the DFA terminates in a final state. The set of all t e V* which is accepted by the DFA F is specified by L{F). Thus L(F) is defined as

An example of a deterministic finite-state acceptor.

Figure 4-5 An example of a deterministic finite-state acceptor.

The following DFA is an example of an acceptor which accepts strings consisting only of an even number of Os and an even number of Is.


where M is defined as


The transition diagram for this acceptor is given in Fig. 4-5.

The mapping function for an acceptor is sometimes more conveniently illustrated with a transition table such as Table 4-1, which defines M in the previous example.

An example of a string which is accepted by this DFA is 110101, and a string not accepted is 11101. Traces of the DFA operating on these strings are given in Table 4-2.

In this subsection, deterministic finite-state acceptors were discussed. The next subsection examines finite-state acceptors which may have more than one transition for a given state and input character. Such an acceptor is shown to be no more powerful than its deterministic counterpart.

Next post:

Previous post: