Database Reference
In-Depth Information
generate and output information not already contained in the schema information
(e.g., the chosen alternative for a choice-operator or the number of repetitions for a *-
operator within the DTD). These approaches are queryable and applicable to XML
streams, but they can only be used if schema information is available.
XQzip [11] and the approaches presented in [16] and [4] belong to grammar-based
compression. They compress the data structure of an XML document by combining
identical sub-trees.
An extension of [4] and [11] is the BPLEX algorithm [5]. This approach not only
combines identical sub-trees, but recognizes similar patterns within the XML tree, and
therefore allows a higher degree of compression. The approach presented in this
paper, which is an extension of [17], follows the same idea. But instead of combining
similar structures bottom-up, our approach searches within a given window the most
promising pair to be combined while following one of three possible clustering
strategies. Furthermore, in contrast to [12] and [18], that performs updates by path
isolation only sequentially, our approach allows performing updates in parallel which
takes only a fraction of time.
6 Summary and Conclusions
We have shown how updates can be performed directly on CluX, a clustering-based
compression approach for XML trees, i.e., without the need to decompress the
compressed data in advance. As an XML file compressor, CluX compresses on
average 70% better than the generic compressor gzip and 5% better than the generic
compressor bzip2. CluX compression can be applied to infinite data streams - and in
contrast to gzip or bzip2, path queries and updates can be evaluated directly on the
compressed representation, i.e., without prior decompression. Beyond other clustering
or multiplexing based approaches like e.g. the BPLEX algorithm [12], [5], CluX
offers an update DAG isolation technique that allows to perform several updates in
parallel, and our evaluation has shown that performing 100 updates in parallel takes
significantly less time than performing 100 updates sequentially. Furthermore, our
evaluation on a file with a size of 15 MB has shown that performing the updates
directly on the compressed data with our update algorithm is more than 3 times faster
than decompressing the data first and recompressing it with CluX, and it is more than
4 times faster than the decompression and recompression with bzip2.
We furthermore believe that this technique of performing several updates in
parallel on the compressed data directly is not restricted to CluX, but can be extended
to DAG-based compressors like [5] and to other grammar-based compressors like e.g.
BPLEX [5], the main idea of which is to share similar sub-trees.
References
1. Zhang, N., Kacholia, V., Özsu, M.: A Succinct Physical Storage Scheme for Efficient
Evaluation of Path Queries in XML. In: Proceedings of the 20th International Conference
on Data Engineering, ICDE 2004, Boston, MA, USA, pp. 54-65 (2004)
2. Ng, W., Lam, W., Wood, P., Levene, M.: XCQ: A queriable XML compression system.
Knowl. Inf. Syst., 421-452 (2006)
Search WWH ::




Custom Search