Database Reference
In-Depth Information
2.2 The Idea Behind Sharing Similar Trees
The simplest grammar-based XML compressors are those compressors that share
identical sub-tree structures, such that the compressed grammar represents the
minimal DAG of the XML tree [4].
Approaches like binary DAG compression, that share identical sub-trees T in an
XML document D replace repeated occurrences of T in D by replacing each
occurrence of T in D with a non-terminal N and adding a grammar rule that defines N
to be a non-terminal that represents T.
In Grammar 1, there are four matches for each of the two patterns b(
).
Therefore, these matches can be replaced by the non-terminals B and E respectively,
such that we get the following grammar:
ε
,
ε
) and e(
ε
,
ε
S r(k1(c(d(j(m(
ε
,
ε
),B),E),
ε
), k2(h(g(j(n (
ε
,
ε
),B),E),
ε
),
k3(c(d(j(o(
ε
,
ε
),B),E),
ε
), k4(h(g(j(p (
ε
,
ε
),B),E),
ε
),
ε
)))),
ε
)
B b(
ε
,
ε
)
E e(
ε
,
ε
)
Grammar 2: Grammar corresponding to the binary DAG of the XML tree of Fig. 1
Our approach goes beyond the idea of DAG compression and uses a parameterized
grammar for sharing not only identical sub-trees, but even similar sub-trees. It follows
the idea of grammar-based compression as it was introduced in BPLEX [5].
When looking for similar sub-trees having small differences, we find the three
different patterns shown in Fig. 2(b) in the document tree D of Fig. 1: one pattern
consisting of the nodes c, d, and e, another pattern consisting of the nodes h, g, and e,
and a third pattern consisting of the nodes j and b respectively. For each of these
patterns, there exist several matches in D which are highlighted in Fig. 2(a). Although
the matches of the patterns have identical inner nodes, they cannot be shared in a
DAG because the leaf nodes with labels m, n, o, or p respectively differ from each
other.
Fig. 2 (b) shows the patterns JB(X), CDE(X) and HGE(X), which consist of the
nodes (j and b), (c, d, and e), and (h, g, and e) respectively. The nodes j, d, and g have
a parameter 'X' as first-child.
The compression achieved by replacing the repeated patterns with a non-terminal
can be seen when comparing Grammar 2 with Grammar 3 shown below. We express
the pattern JB(X) of Fig. 2(b) by one grammar rule with the left hand side JB(X),
where the parameter 'X' is being used for referencing the different child nodes m, n,
o, and p of the j-nodes. This grammar rule is being used, e.g. when the term
j(m(
ε
,
ε
),B) occurring in the rule S of Grammar 2 is replaced with the term JB(m(
ε
,
ε
))
occurring in the rule S of Grammar 3. Here, j(m(
ε
,
ε
),B) is called a match , and
JB(m(
ε
,
ε
)) is called a corresponding instantiation of the pattern JB(X).
Search WWH ::




Custom Search