Database Reference
In-Depth Information
PATTERN-BASED APPROACH TO XML SCHEMAS AND INSTANCES
XML Schemas
Schemas for XML data are usually specified by means of XSDL (
XML Schema Definition Language
)
(XML Schema, 2009) or DTD (
Document Type Definition
) (Martens et al., 2007). The aim of such a
schema is to control well-formedness and validity of XML documents. We assume that the structure of
XML documents conform to their DTDs and that attributes are represented by so called
terminal ele-
ments
labeled with
terminal labels
of
Text type
(we ignore mixed elements).
Let
L
be a set of labels,
Ter
∧
L
be a set of terminal labels, and
top
∧
L
-
Ter
be the outermost (top)
label.
Definition 1.
(
DTD
) A tuple
D
= (
top
,
L
,
Ter
, σ) is a DTD, if σ is a function assigning regular expres-
sions over
L
- {
top
} to non-terminal labels, and
Text
to terminal labels, i.e.
•
σ:
L
-
Ter
→
Reg
•
σ:
Ter
→ {
Text
},
the set
Reg
of regular expressions is defined by the grammar:
e
::=
l
| (
e
) |
e
? |
e*
|
e
+
|
e
+
e
|
e e
.
Example 1.
In Figure 1 there are graphical representations of the following DTDs (
D
a
is a recursive
DTD, while D
1
is non-recursive):
D
1
= (
pubs
, {
pubs, pub, title, year, author, name, university
}, {
title, year,
name, university
}, σ),
where
σ(
pubs
) =
pub
*, σ(
pub
) =
title year
?
author
+, σ(
author
) =
name university
?
σ(
title
) = σ(
year
) = σ(
name
) = σ(
university
) =
Text
.
D
a
= (
parts
, {
parts
,
part
,
pid
}, {
pid
}, σ),
where:
σ(
parts
) =
part
*, σ(
part
) =
pid part
*, σ(
pid
) =
Text
.
In this chapter, we assume that all XML documents are valid, so we will not be interested in checking
their validity against DTDs. Instead, we are interested in such a formal description of the structures of
XML documents that would be convenient for defining mappings and transforming data in XML data
integration processes.
Since every XML document is a finite tree, thus its structure can be specified by a
tree pattern
(Xu
& Ozsoyoglu, 2005). For documents having recursive DTDs their tree patterns are restricted to finite
depths. The notion of tree-patterns can be used to define
tree-pattern formulas
(Arenas & Libkin, 2005).
A tree-pattern formula arises from a tree pattern by assigning variables to terminal labels (i.e. to paths
starting in the root and ending with this terminal label). In Figure 1 the tree-pattern formulas,
S
1
and
S
a
,