Java Reference
In-Depth Information
Modern programming languages often contain a grammar in their specifi-
cation as a guide to those who teach, study, or use the language. Based on the
analysis discussed in this chapter, such grammars can also participate in the
automatic construction of syntax-checking parsers. Programming language
rules that are not easily expressed by grammars are enforced during the
se-
mantic analyses
discussed in Chapters 7, 8, and 9. In Chapter 2, we discuss
the rudiments of
context-free grammars
(CFGs) and define a simple language
using a CFGs. Here, we formalize the definition and notation for CFGss and
present algorithms that analyze such grammars in preparation for the parsing
techniques covered in Chapters 5 and 6.
Formally, a
language
is a set of finite-length strings over a finite alphabet.
Because most interesting languages are infinite sets, we cannot define such
languages by enumerating their elements. A
context-free grammar
(CFG) is
a compact, finite representation of a language, defined by the following four
components:
•
Afinite
terminal alphabet
. This is the set of tokens produced by the
scanner. We always augment this set with the token $, which signifies
end-of-input.
Σ
•
Afinite
nonterminal alphabet
N
. Symbols in this alphabet are
variables
of the grammar.
•
A
start symbol
S
∈
N
that initiates all
derivations
. S is also called the
goal symbol
.
•
A finite set of productions
P
(sometimes called
rewriting rules
)ofthe
form A
→X
1
...X
m
,whereA
∈
N
,
X
i
∈
N
∪Σ
,1
≤
i
≤
m
,and
m
≥
0. The
only valid production with
m
=
0isoftheformA
→λ
,where
λ
denotes
the
empty string
.
These components are often expressed as G
S ), which is the formal
definition of a CFG. The terminal and nonterminal alphabets must be disjoint
(i.e.,
=
(
N
,Σ,
P
,
). The
vocabulary
V
of a CFG is the set of terminal and
nonterminal symbols (i.e.,
V
=Σ∪
N
).
A CFG is essentially a recipe for creating strings. Starting with S,non-
terminals are rewritten using the grammar's productions until only terminals
remain. A rewrite using the production A
Σ∩
N
= ∅
→α
replaces the nonterminal A with
the vocabulary symbols in
α
. As a special case, a rewrite using the production
A
→λ
causes A to be erased. Each rewrite is a step in a
derivation
of the