Database Reference
In-Depth Information
Schema languages for XML
Relational schemas essentially specify a list of attributes (columns) for every relation. In
the case of XML one also needs to specify a set of attributes for each label. This, however,
is insufficient, as schemas must describe the hierarchical structure of XML documents too.
To see how this can be done, let us revisit again the example in
Figure 10.1
. If we look at
the root and read the labels of its children, from left to right, then we get a string
country
country
. If we look at the
Scotland
node and read the labels of its children, then we see a
string
ruler ruler ruler ruler
. It is a common approach to describe schemas in terms
of the allowed strings obtained in this way. The simplest such schemas are
Document
Type Definitions
,or
DTDs
. They put restrictions on strings of labels in terms of regular
languages they belong to.
For example, the tree from
Figure 10.1
conforms to the DTD below:
country
∗
europe
→
country
:@
name
ruler
∗
country
→
ruler
:@
name
The first column restricts regular expressions of strings of labels. It says that labels of
children of a
europe
node must come from the language
country
∗
, i.e., be a sequence
of
country
labels, and that labels of children of a
country
node must come from the
language
ruler
∗
. The right column gives a list of attributes: in this case, the attribute
@name
is defined for
country
and
ruler
nodes. Formally, DTDs are defined as follows.
Definition 10.4 (DTDs) A
DTD D
over an alphabet
with a distinguished symbol
r
(for
the root) and a set of attributes
Att
is a triple (
P
D
,
A
D
,
r
) where:
Γ
•
P
D
is a mapping from
Γ
to regular expressions over
Γ
−{
r
}
, which one typically writes
as productions
→
e
if
P
D
(
)=
e
.
2
Att
•
A
D
is a mapping
A
D
:
Γ
→
that assigns a (possibly empty) set of attributes to each
element type.
The size of
D
, written as
D
, is the total length of all used regular expressions, plus
|
Att
|
.
Leaf nodes do not have any children; for instance,
ruler
nodes are such. Technically,
it means that they have an
empty string
of children. For them, DTDs will have rules like
ruler
→
ε
; however, we usually omit them when we describe DTDs. So, if there is no
rule of the form
.
We always assume, for notational convenience, that attributes of a node (if there are
several of them) come in some order. This is the same as in the relational case: attributes
in tuples come in some order so we can write
R
(
a
1
,...,
a
n
). Likewise, we shall describe an
-labeled tree node with
n
attributes as
(
a
1
,...,
a
n
).
Atree
T conforms to a DTD D
=(
P
D
,
A
D
,
r
), written as
T
→
e
in the description of a DTD, we assume that
→
ε
|
=
D
,if:
1. its root is labeled
r
;
2. for each node labeled
, the labels of its children, read left-to-right, form a string in the
language of
P
D
(
);and