Java Reference
In-Depth Information
In lossy data compression, you lose some of the data during the compression process and you will not be able to
recover the original data fully when you decompress the compressed data. Lossy data compression is acceptable in
some situations, such as viewing pictures, audios, and videos, where the audience will not see a noticeable difference
when they use the decompressed data. Compared to the lossless data compression, lossy data compression achieves
a higher compression ratio at the cost of the lower data quality. Examples of lossy data compression algorithms are
Discrete Cosine Transform (DCT), A-Law Compander, Mu-Law Compander, Vector Quantization, etc.
DEFLATE is a lossless data compression algorithm, which is used for compressing data in ZIP and GZIP file
formats. GZIP is an abbreviation for GNU ZIP. GNU is a recursive acronym for GNU's Not Unix. The ZIP file format is
used for data compression and file archival. A file archival is the process of combining multiple files into one file for
convenience of storage. Typically, you compress multiple files and put them together in an archive file.
You may have worked with files with an extension of .zip . A ZIP file uses the ZIP file format. It combines multiple
files into one .zip file by, optionally, compressing them.
If you are a UNIX user, you must have worked with a .tar or .tar.gz file. Typically, on UNIX, you use a two-step
process to create a compressed archive file. First, you combine multiple files into a .tar archive file using the tar file
format (tar stands for T ape Ar chive), you compress that archive file using the GZIP file format to get a .tar.gz or
.tgz file. A .tar.gz or .tgz file is also called a tarball. A tarball is more compressed as compared to a zip file. A zip file
compresses multiple files separately and archives them. A tarball archives the multiple files first and then compresses
them. Because a tarball compresses the combined files together, it takes advantage of data repetition among all files
during compression, resulting in a better compression than a zip file.
ZLIB is a general-purpose lossless data compression library. It is free and not covered by any patents. Java
provides support for data compression using the ZLIB library. Deflater and Inflater are two classes in the
java.util.zip package that support general-purpose data compression/decompression functionality in Java using
the ZLIB library. Java provides classes to support ZIP and GZIP file formats. It also supports another file format called
the jar file format, which is a variation of the ZIP file format. I will discuss examples of the file formats supported by
Java in the next few sections.
Checksum
A checksum is a number that is computed by applying an algorithm on a stream of bytes. Typically, it is used when
data is transmitted across the network to check for errors during data transmission. The sender computes a checksum
for a packet of data and sends that checksum with the packet to the receiver. The receiver computes the checksum
for the packet of data it receives and compares it with the checksum it received from the sender. If the two match, the
receiver may assume that there were no errors during the data transmission. The sender and the receiver must agree
to compute the checksum for the data by applying the same algorithm. Otherwise, the checksum will not match.
Using a checksum is not a data security measure to authenticate the data. It is used as an error-detection method.
A hacker can alter some bits of the data and you may still get the same checksum as for the original data.
Let's discuss an algorithm to compute a checksum. The algorithm is called Adler-32 after its inventor Mark
Adler. Its name has the number 32 in it because it computes a checksum by computing two 16-bit checksums and
concatenating them into a 32-bit integer. Let's call the two 16-bit checksums A and B, and the final checksum C. A is
the sum of all bytes plus one in the data. B is the sum of individual values of A from each step. In the beginning, A is
set to 1 and B is set to 0. A and B are computed based on modulus 65521. That is, if the value of A or B exceeds 65521,
their values become their current values modulo 65521. The final checksum is computed as follows:
C = B * 65536 + A
The final checksum is computed by concatenating the 16-bit B and A values. You need to multiply the value of B
by 65536 and add the value of A to it to get the decimal value of that 32-bit final checksum number.
Let's apply the Adler-32 checksum algorithm to compute a checksum for a string HELLO , as shown in Table 9-1 .
 
Search WWH ::




Custom Search