Updates on Grammar-Compressed XML Data - Advances in Databases

Database Reference

In-Depth Information

generate and output information not already contained in the schema information

(e.g., the chosen alternative for a choice-operator or the number of repetitions for a *-

operator within the DTD). These approaches are queryable and applicable to XML

streams, but they can only be used if schema information is available.

XQzip [11] and the approaches presented in [16] and [4] belong to grammar-based

compression. They compress the data structure of an XML document by combining

identical sub-trees.

An extension of [4] and [11] is the BPLEX algorithm [5]. This approach not only

combines identical sub-trees, but recognizes similar patterns within the XML tree, and

therefore allows a higher degree of compression. The approach presented in this

paper, which is an extension of [17], follows the same idea. But instead of combining

similar structures bottom-up, our approach searches within a given window the most

promising pair to be combined while following one of three possible clustering

strategies. Furthermore, in contrast to [12] and [18], that performs updates by path

isolation only sequentially, our approach allows performing updates in parallel which

takes only a fraction of time.

6 Summary and Conclusions

We have shown how updates can be performed directly on CluX, a clustering-based

compression approach for XML trees, i.e., without the need to decompress the

compressed data in advance. As an XML file compressor, CluX compresses on

average 70% better than the generic compressor gzip and 5% better than the generic

compressor bzip2. CluX compression can be applied to infinite data streams - and in

contrast to gzip or bzip2, path queries and updates can be evaluated directly on the

compressed representation, i.e., without prior decompression. Beyond other clustering

or multiplexing based approaches like e.g. the BPLEX algorithm [12], [5], CluX

offers an update DAG isolation technique that allows to perform several updates in

parallel, and our evaluation has shown that performing 100 updates in parallel takes

significantly less time than performing 100 updates sequentially. Furthermore, our

evaluation on a file with a size of 15 MB has shown that performing the updates

directly on the compressed data with our update algorithm is more than 3 times faster

than decompressing the data first and recompressing it with CluX, and it is more than

4 times faster than the decompression and recompression with bzip2.

We furthermore believe that this technique of performing several updates in

parallel on the compressed data directly is not restricted to CluX, but can be extended

to DAG-based compressors like [5] and to other grammar-based compressors like e.g.

BPLEX [5], the main idea of which is to share similar sub-trees.

References

1. Zhang, N., Kacholia, V., Özsu, M.: A Succinct Physical Storage Scheme for Efficient

Evaluation of Path Queries in XML. In: Proceedings of the 20th International Conference

on Data Engineering, ICDE 2004, Boston, MA, USA, pp. 54-65 (2004)

2. Ng, W., Lam, W., Wood, P., Levene, M.: XCQ: A queriable XML compression system.

Knowl. Inf. Syst., 421-452 (2006)

Search WWH ::

Custom Search

Home