Scanners (Compiler Writing) Part 4

EQUIVALENCE OF REGULAR GRAMMARS AND FINITE-STATE ACCEPTORS

The equivalence of regular grammars and finite-state acceptors is shown in this section. First, a method for constructing an NFA from a regular grammar is given. Then a way of converting a DFA to a regular grammar is illustrated, completing the proof that the languages accepted by finite-state acceptors are the same as the languages which can be produced by regular grammars.

The following theorem and proof provide the procedure for converting a regular grammar to a nondeterministic finite-state acceptor.

Theorem 4-3 There exists a nondeterministic finite-state acceptor F = (K, VT, M, S, Z) which accepts the language generated by the regular grammartmp4C8-337_thumb[2][2][2]

Proof Define the NFA F with the states of F beingtmp4C8-338_thumb[2][2][2]where


tmp4C8-339_thumb[2][2][2]The initial state of the acceptor is £ (the start symbol of the grammar), and its final state is X. For each production of the grammar, construct the mapping M from $ in the following manner:

1.tmp4C8-343_thumb[2][2][2]if there is a productiontmp4C8-344_thumb[2][2][2]

2.tmp4C8-345_thumb[2][2][2]if there is a productiontmp4C8-346_thumb[2][2][2]

The acceptor F, when processing sentence x, simulates a derivation of x in the grammar G. It is necessary to show that L(F) = L{G). Lettmp4C8-347_thumb[2][2][2] tmp4C8-348_thumb[2][2][2]be in the language L(G). Then there exists some derivation in G such that tmp4C8-355_thumb[2][2][2] for a sequence of nonterminalstmp4C8-356_thumb[2][2][2]From the construction of M, it is clear that M(S, a:) containstmp4C8-357_thumb[2][2][2]contains A2,…, and tmp4C8-358_thumb[2][2][2]contains X. Therefore,tmp4C8-359_thumb[2][2][2]since M(S, x) contains X andtmp4C8-360_thumb[2][2][2]

Conversely, iftmp4C8-361_thumb[2][2][2]a derivation of G which simulates the acceptance of x in F can be easily obtained, thereby concluding thattmp4C8-362_thumb[2][2][2]

To illustrate how an NFA is constructed from a regular grammar, consider the grammartmp4C8-363_thumb[2][2][2]wheretmp4C8-364_thumb[2][2][2]the set of productions $ are:

tmp4C8-374_thumb[2][2][2]

and which generates strings of the formtmp4C8-375_thumb[2][2][2]Define

antmp4C8-376_thumb[2][2][2] and M is given by

tmp4C8-379_thumb[2][2][2]

Then F is a nondeterministic finite-state acceptor which accepts the language represented by the regular grammar G.

Just as a nondeterministic finite-state acceptor can be constructed from a regular grammar in a straightforward manner, a regular grammar also can be derived from a deterministic finite-state acceptor in a simple way. The following theorem, like the previous theorem, furnishes the procedure needed to construct a regular grammar.

Theorem 4-4 There exists a regular grammartmp4C8-380_thumb[2][2][2]which produces the language accepted by a given deterministic finite-state acceptor tmp4C8-381_thumb[2][2][2]

Proof Define the regular grammar G with the states of K being the nonterminals of G. The starting symbol of G is S (the starting state of F), and the set of productions are defined as:

tmp4C8-384_thumb[2][2][2]

It must be shown thattmp4C8-385_thumb[2][2][2]if and only iftmp4C8-386_thumb[2][2][2]In the case oftmp4C8-387_thumb[2][2][2]add the productiontmp4C8-388_thumb[2][2][2]

Lettmp4C8-389_thumb[2][2][2]Then there is the set of transitions:

tmp4C8-395_thumb[2][2][2]

where A„ is a final state of F. Thus G contains the productions

tmp4C8-396_thumb[2][2][2]

and the grammar G can produce strings accepted by F.

Conversely, iftmp4C8-397_thumb[2][2][2], then an acceptance of x in F which simulates a derivation in G can be easily obtained, thereby concluding thattmp4C8-398_thumb[2][2][2]

To illustrate the conversion of a deterministic finite-state acceptor to a regular grammar, consider the acceptortmp4C8-399_thumb[2][2][2]where the mapping function is given in Table 4-9. This DFA accepts strings which have an even number of Os and Is. Using the method for constructing a regular grammar as given in Theorem 4-4, the grammar G is defined as G = ({5, A, B, C}, {0,1} S, where <E> is defined as

tmp4C8-403_thumb[2][2][2]

Table 4-9

State

Input

0

1

S

B

A

A

C

S

B

S

C

C

A

B

Because S is a final state, wherever a production of the form a -* {iS occurs, the production a /? is also included in The production 5 -» e is also in $ because SeZin the DFA.

Because nondeterministic finite-state acceptors are equivalent to deterministic acceptors, the equivalence of finite-state acceptors to regular grammars has been established by Theorems 4-3 and 4-4. Thus regular grammars may be used to represent scanners which are implemented as finite-state acceptors. However, regular expressions are more easily used to depict scanners, and demonstrating the equivalence of regular expressions to finite-state acceptors is the topic of the next section.

EQUIVALENCE OF REGULAR EXPRESSIONS AND FINITE-STATE ACCEPTORS

Regular expressions, which were introduced in Sec. 4-3, are often used as a convenient way of describing scanners. The importance of this will become evident in the next section when a scanner generator which uses regular expressions as input is presented. This section details the methods for generating a finite-state acceptor from a regular expression or turning a finite-state acceptor into a regular expression. In so doing, it will be shown that regular expressions and finite-state acceptors are equivalent. After recalling the theories of the previous section, it then can be seen that the languages which can be derived from regular grammars and regular expressions are the same.

It is conceptually easier to understand how a regular expression can be converted to a finite-state acceptor by using a transition diagram. Before proceeding, however, it is useful to show how the three regular expressions e, </>, and p, where p e Vr, are represented in a transition diagram. Figure 4-10 illustrates the finite-state diagrams associated with each of these expressions. The expression e is drawn in Fig. 4-10a and indicates that no state transition occurs on empty input (e-transitions will be included later). Figure 4-106 indicates that the regular expression does not accept any input string, including e, and the transition that occurs with the regular expression p is exhibited in Fig. 4-10c.

With the transition diagrams of Fig. 4-10 representing the most simple regular expressions, more complex regular expressions can be developed using the operations of alternation, concatenation, and closure.

Transition diagrams for the regular expressions e, <#>, and p.

Figure 4-10 Transition diagrams for the regular expressions e, <#>, and p.

Figure 4-11 illustrates how two regular expressions, and r2, may be represented by an NFA with e-transitions when one of the operations is applied. In each case, a new starting state 5 and a new final state Z have been created, both of which are states separate from the finite-state acceptors representing rj and r2. An arc leading into an expression r1 or r2 represents a transition to the starting state of the FSA representing that expression, while the arc leading out of the FSA for an expression depicts a transition from the final state of the acceptor to the next state.

The transition diagram for the expression rx\r2 is given in Fig. 4-1 la. As the acceptor must accept one of rx and r2, e-transitions from the starting state to the FSAs for rx and r2 are drawn. Concatenation, which is illustrated in Fig. 4-11 b for the expression rxr2, makes an e-transition from rx to r2 necessary as well as one from the starting state to rl and r2 to the final state. Figure 4-1 lc depicts the FSA for the expression {r,}. The e-transition from the final state of rx to its starting state permits more than one repetition of rx by the finite-state acceptor. Zero occurrences of the expression are allowed with the e-transition from the starting state S to the final state Z.

Transition diagrams for the expressions

Figure 4-11 Transition diagrams for the expressions tmp4C8-409_thumb[2][2][2]

To demonstrate how the rules for constructing an NFA with e-transitions from a regular expression may be applied, the regular expression 0{1 [23} is used. Figure 4-12 illustrates the steps involved in creating the acceptor from this expression. By rewriting the expression in a fully parenthesized form, the order of evaluation becomes more evident and the finite-state acceptor can be created in a step-by-step manner starting with the operators to be evaluated first. Hence the expression can be written as 0{1|(23)}. With the terminal symbols of the expression being the set {0,1,2,3}, the first operation to be performed is the concatenation of the two subexpressions 2 and 3.

Creation of an NFA with E-transitions.

Figure 4-12 Creation of an NFA with E-transitions.

The finite-state acceptor for the resulting expression 23 is created as shown in Fig. 4-12a. Then the alternation operator is applied to the two subexpressions 1 and 23, resulting in the finite-state acceptor of Fig. 4-126. Figure 4-12c gives the finite-state acceptor for the expression (1|(23)} once closure has been applied to the preceding expression. Finally, the zero concatenated with the expression of Fig. 4-12c gives the entire regular expression, and the finite-state acceptor which is produced is given in Fig. 4-12^. For simplicity, several e-transitions in Fig. 4-12a were eliminated.

Theorem 4-5 Given a regular expression R, there exists an NFA with e-transitions, F, which accepts the language generated by R.

Proof Only an outline of the proof is given to sketch how an NFA with e-transitions is constructed.

The proof is done by induction on each of the three operators found in regular expressions. In the case of alternation in which the FSAs for the two subexpressions rx and r2 aretmp4C8-412_thumb[2][2][2] tmp4C8-413_thumb[2][2][2], respectively, a new FSAtmp4C8-414_thumb[2][2][2]is defined as follows:tmp4C8-415_thumb[2][2][2]> and the setstmp4C8-416_thumb[2][2][2]are all disjoint. Then the mapping function is defined astmp4C8-417_thumb[2][2][2]with the following mappings also given:

tmp4C8-418_thumb[2][2][2]

For concatenation, the new FSA is defined astmp4C8-419_thumb[2][2][2]where K and VT are the same as with alternation, and the mapping function is defined astmp4C8-420_thumb[2][2][2]with the mappings

tmp4C8-421_thumb[2][2][2] also included withtmp4C8-422_thumb[2][2][2]Closure requires a new acceptortmp4C8-423_thumb[2][2][2] tmp4C8-424_thumb[2][2][2]which is created from the original acceptortmp4C8-425_thumb[2][2][2]

tmp4C8-426_thumb[2][2][2]has the same mappings as M but with the additions

tmp4C8-427_thumb[2][2][2]

It is shown by induction that with the application of an operator to the regular expressions as just described, the newly created finite-state acceptor accepts the same language as is generated by the expression resulting from the application of the operator.

Using the regular expression 0(11 (23)} from the previous example, an FSA can be derived astmp4C8-428_thumb[2][2][2]where the mapping function is defined in Table 4-12. This NFA with e-transitions is created by the application of the rules of construction given in the preceding proof to the expression starting with the most nested operators, as was done when transition diagrams were used in the previous example.

Table 4-12

State

Input

0

1

2

3

e

tmp4C8-447 tmp4C8-448 tmp4C8-449 tmp4C8-450 tmp4C8-451 tmp4C8-452
tmp4C8-453 tmp4C8-454 tmp4C8-455 tmp4C8-456 tmp4C8-457 tmp4C8-458
tmp4C8-459 tmp4C8-460 tmp4C8-461 tmp4C8-462 tmp4C8-463 tmp4C8-464
tmp4C8-465 tmp4C8-466 tmp4C8-467 tmp4C8-468 tmp4C8-469 tmp4C8-470
tmp4C8-471 tmp4C8-472 tmp4C8-473 tmp4C8-474 tmp4C8-475 tmp4C8-476
tmp4C8-477 tmp4C8-478 tmp4C8-479 tmp4C8-480 tmp4C8-481 tmp4C8-482
tmp4C8-483 tmp4C8-484 tmp4C8-485 tmp4C8-486 tmp4C8-487 tmp4C8-488
tmp4C8-489 tmp4C8-490 tmp4C8-491 tmp4C8-492 tmp4C8-493 tmp4C8-494
tmp4C8-495 tmp4C8-496 tmp4C8-497 tmp4C8-498 tmp4C8-499 tmp4C8-500
tmp4C8-501 tmp4C8-502 tmp4C8-503 tmp4C8-504 tmp4C8-505 tmp4C8-506

The combination of the previous and following theorem will establish the equivalence of regular expressions and finite-state acceptors. Because FSAs and regular grammars are also equivalent, it can be seen that regular grammars and regular expressions are equivalent.

Theorem 4-6 Given a deterministic finite-state acceptor which accepts the language L, there exists a regular expression which denotes L.

Proof As was the case with the previous theorem, only an outline of the proof is given.

For the DFAtmp4C8-507_thumb[2][2][2]is defined to be the set of all strings x such that ~tmp4C8-508_thumb[2][2][2]and for any prefix of x, say, y, but not including x or e, thentmp4C8-509_thumb[2][2][2]Thustmp4C8-510_thumb[2][2][2]is the set of all strings taking the acceptor from state qt to state ^ without reaching a state numbered greater than k. Thencan be recursively defined as follows:tmp4C8-511_thumb[2][2][2]

tmp4C8-517_thumb[2][2][2]

It is then shown by induction that for all /’, j, and k, a regular expressiontmp4C8-518_thumb[2][2][2] denoting the languagetmp4C8-519_thumb[2][2][2]can be found, starting withtmp4C8-520_thumb[2][2][2]as the basis. Finally, the language L accepted by F can be represented astmp4C8-521_thumb[2][2][2]where the largest state of F is qn. Thus the regular expression is given by where

tmp4C8-522_thumb[2][2][2]are final states of F.tmp4C8-523_thumb[2][2][2]

Figure 4-13

Figure 4-13

The proof of Theorem 4-6 provides the procedure for finding a regular expression which denotes the language accepted by a DFA. Notice that the equation

tmp4C8-531_thumb[2][2][2]

can be represented by the regular-expression equation

tmp4C8-532_thumb[2][2][2]

An example should illustrate how this method can be used to create a regular expression. The transition diagram of a DFA which accepts strings of the form

tmp4C8-533_thumb[2][2][2]is given in Fig. 4-13. The values of alltmp4C8-534_thumb[2][2][2]for this DFA are listed in Table 4-13 after they have been simplified. The first column of the table, with k = 0, requires that an intermediate state cannot be greater than q0. As the smallest state is qx, any expressiontmp4C8-535_thumb[2][2][2]can only generate a single character or an intermediate state would be required in going from state q, to state qr The expression canalso be the empty string if qt and q} are the same state. Thus the expressiontmp4C8-536_thumb[2][2][2]can be either the empty-string expression or the expression a, andtmp4C8-537_thumb[2][2][2]can be the expression b. The expressiontmp4C8-538_thumb[2][2][2]where k is 1, 2, or 3, gives the expression {a}. Only qx can be an intermediate state if qx is also to be the end state. The expression fortmp4C8-539_thumb[2][2][2]is {a}b because only qx can be an intermediate state. For the expressionstmp4C8-540_thumb[2][2][2]the expression {a}b{b} is derived as intermediate states qx and q2 are allowed. The expressionstmp4C8-541_thumb[2][2][2] are the same, since q3 cannot be an intermediate state if q2 is the end state. As the only final state for this DFA is q3, the regular expression for the DFA is

tmp4C8-542_thumb[2][2][2]

Table 4-13

k

0

1

2

3

tmp4C8-553 tmp4C8-554 tmp4C8-555 tmp4C8-556 tmp4C8-557
tmp4C8-558 tmp4C8-559 tmp4C8-560 tmp4C8-561 tmp4C8-562
tmp4C8-563 tmp4C8-564 tmp4C8-565 tmp4C8-566 tmp4C8-567
tmp4C8-568 tmp4C8-569 tmp4C8-570 tmp4C8-571 tmp4C8-572
tmp4C8-573 tmp4C8-574 tmp4C8-575 tmp4C8-576 tmp4C8-577
tmp4C8-578 tmp4C8-579 tmp4C8-580 tmp4C8-581 tmp4C8-582
tmp4C8-583 tmp4C8-584 tmp4C8-585 tmp4C8-586 tmp4C8-587
tmp4C8-588 tmp4C8-589 tmp4C8-590 tmp4C8-591 tmp4C8-592
tmp4C8-593 tmp4C8-594 tmp4C8-595 tmp4C8-596 tmp4C8-597

To illustrate further how some expressions r(* are derived and simplified, the following examples are given:

tmp4C8-598_thumb[2][2][2]

This section has shown that regular expressions and finite-state acceptors both represent the same class of languages. Methods of converting from one form to the other also were described. The next section uses regular expressions to describe scanners for a scanner generator.

Next post:

Previous post: