Working with Archive Files - Beginning Java 8 Language Features

Java Reference

In-Depth Information

In lossy data compression, you lose some of the data during the compression process and you will not be able to

recover the original data fully when you decompress the compressed data. Lossy data compression is acceptable in

some situations, such as viewing pictures, audios, and videos, where the audience will not see a noticeable difference

when they use the decompressed data. Compared to the lossless data compression, lossy data compression achieves

a higher compression ratio at the cost of the lower data quality. Examples of lossy data compression algorithms are

Discrete Cosine Transform (DCT), A-Law Compander, Mu-Law Compander, Vector Quantization, etc.

DEFLATE is a lossless data compression algorithm, which is used for compressing data in ZIP and GZIP file

formats. GZIP is an abbreviation for GNU ZIP. GNU is a recursive acronym for GNU's Not Unix. The ZIP file format is

used for data compression and file archival. A file archival is the process of combining multiple files into one file for

convenience of storage. Typically, you compress multiple files and put them together in an archive file.

You may have worked with files with an extension of .zip . A ZIP file uses the ZIP file format. It combines multiple

files into one .zip file by, optionally, compressing them.

If you are a UNIX user, you must have worked with a .tar or .tar.gz file. Typically, on UNIX, you use a two-step

process to create a compressed archive file. First, you combine multiple files into a .tar archive file using the tar file

format (tar stands for T ape Ar chive), you compress that archive file using the GZIP file format to get a .tar.gz or

.tgz file. A .tar.gz or .tgz file is also called a tarball. A tarball is more compressed as compared to a zip file. A zip file

compresses multiple files separately and archives them. A tarball archives the multiple files first and then compresses

them. Because a tarball compresses the combined files together, it takes advantage of data repetition among all files

during compression, resulting in a better compression than a zip file.

ZLIB is a general-purpose lossless data compression library. It is free and not covered by any patents. Java

provides support for data compression using the ZLIB library. Deflater and Inflater are two classes in the

java.util.zip package that support general-purpose data compression/decompression functionality in Java using

the ZLIB library. Java provides classes to support ZIP and GZIP file formats. It also supports another file format called

the jar file format, which is a variation of the ZIP file format. I will discuss examples of the file formats supported by

Java in the next few sections.

Checksum

A checksum is a number that is computed by applying an algorithm on a stream of bytes. Typically, it is used when

data is transmitted across the network to check for errors during data transmission. The sender computes a checksum

for a packet of data and sends that checksum with the packet to the receiver. The receiver computes the checksum

for the packet of data it receives and compares it with the checksum it received from the sender. If the two match, the

receiver may assume that there were no errors during the data transmission. The sender and the receiver must agree

to compute the checksum for the data by applying the same algorithm. Otherwise, the checksum will not match.

Using a checksum is not a data security measure to authenticate the data. It is used as an error-detection method.

A hacker can alter some bits of the data and you may still get the same checksum as for the original data.

Let's discuss an algorithm to compute a checksum. The algorithm is called Adler-32 after its inventor Mark

Adler. Its name has the number 32 in it because it computes a checksum by computing two 16-bit checksums and

concatenating them into a 32-bit integer. Let's call the two 16-bit checksums A and B, and the final checksum C. A is

the sum of all bytes plus one in the data. B is the sum of individual values of A from each step. In the beginning, A is

set to 1 and B is set to 0. A and B are computed based on modulus 65521. That is, if the value of A or B exceeds 65521,

their values become their current values modulo 65521. The final checksum is computed as follows:

C = B * 65536 + A

The final checksum is computed by concatenating the 16-bit B and A values. You need to multiply the value of B

by 65536 and add the value of A to it to get the decimal value of that 32-bit final checksum number.

Let's apply the Adler-32 checksum algorithm to compute a checksum for a string HELLO , as shown in Table 9-1 .

Beginning Java 8 Language Features

Search WWH ::

Custom Search

Home