The problem of XML data exchange - Foundations of Data Exchange

Database Reference

In-Depth Information

Schema languages for XML

Relational schemas essentially specify a list of attributes (columns) for every relation. In

the case of XML one also needs to specify a set of attributes for each label. This, however,

is insufficient, as schemas must describe the hierarchical structure of XML documents too.

To see how this can be done, let us revisit again the example in Figure 10.1 . If we look at

the root and read the labels of its children, from left to right, then we get a string country

country . If we look at the Scotland node and read the labels of its children, then we see a

string ruler ruler ruler ruler . It is a common approach to describe schemas in terms

of the allowed strings obtained in this way. The simplest such schemas are Document

Type Definitions ,or DTDs . They put restrictions on strings of labels in terms of regular

languages they belong to.

For example, the tree from Figure 10.1 conforms to the DTD below:

country ∗

europe

→

country :@ name

ruler ∗

country

→

ruler :@ name

The first column restricts regular expressions of strings of labels. It says that labels of

children of a europe node must come from the language country ∗ , i.e., be a sequence

of country labels, and that labels of children of a country node must come from the

language ruler ∗ . The right column gives a list of attributes: in this case, the attribute

@name is defined for country and ruler nodes. Formally, DTDs are defined as follows.

Definition 10.4 (DTDs) A DTD D over an alphabet

with a distinguished symbol r (for

the root) and a set of attributes Att is a triple ( P D , A D , r ) where:

• P D is a mapping from

to regular expressions over

Γ −{ r }

, which one typically writes

as productions

→ e if P D ( )= e .

2 Att

•

A D is a mapping A D :

Γ →

that assigns a (possibly empty) set of attributes to each

element type.

The size of D , written as

, is the total length of all used regular expressions, plus

Att

Leaf nodes do not have any children; for instance, ruler nodes are such. Technically,

it means that they have an empty string of children. For them, DTDs will have rules like

ruler → ε

; however, we usually omit them when we describe DTDs. So, if there is no

rule of the form

We always assume, for notational convenience, that attributes of a node (if there are

several of them) come in some order. This is the same as in the relational case: attributes

in tuples come in some order so we can write R ( a 1 ,..., a n ). Likewise, we shall describe an

-labeled tree node with n attributes as ( a 1 ,..., a n ).

Atree T conforms to a DTD D =( P D , A D , r ), written as T

→ e in the description of a DTD, we assume that

→ ε

= D ,if:

1. its root is labeled r ;

2. for each node labeled , the labels of its children, read left-to-right, form a string in the

language of P D ( );and

Foundations of Data Exchange

Search WWH ::

Custom Search

Home