Java Reference
In-Depth Information
file compression
12.1
The ASCII character set consists of roughly 100 printable characters. To dis-
tinguish these characters, bits are required. Seven bits allow
the representation of 128 characters, so the ASCII character set adds some
other “unprintable” characters. An eighth bit is added to allow parity checks.
The important point, however, is that if the size of the character set is C, then
bits are needed in a standard fixed-length encoding.
Suppose that you have a file that contains only the characters a, e, i, s, and
t, blank spaces ( sp ), and newlines ( nl ). Suppose further that the file has 10 a 's,
15 e 's, 12 i 's, 3 s 's, 4 t 's, 13 blanks, and 1 newline. As Figure 12.1 shows, rep-
resenting this file requires 174 bits because there are 58 characters and each
character requires 3 bits.
A standard encod-
ing of C characters
uses
log
100
=
7
log C
bits.
log
C
In real life, files can be quite large. Many very large files are the output of
some program, and there is usually a big disparity between the most fre-
quently and least frequently used characters. For instance, many large data
files have an inordinately large number of digits, blanks, and newlines but few
q 's and x 's.
In many situations reducing the size of a file is desirable. For instance, disk
space is precious on virtually every machine, so decreasing the amount of space
required for files increases the effective capacity of the disk. When data are being
transmitted across phone lines by a modem, the effective rate of transmission is
increased if the amount of data transmitted can be reduced. Reducing the number
of bits required for data representation is called compression, which actually con-
sists of two phases: the encoding phase (compression) and the decoding phase
(uncompression). A simple strategy discussed in this chapter achieves 25 percent
savings on some large files and as much as 50 or 60 percent savings on some large
data files. Extensions provide somewhat better compression.
Reducing the num-
ber of bits required
for data represen-
tation is called
compression , which
actually consists of
two phases: the
encoding phase
(compressing) and
the decoding phase
(uncompressing).
figure 12.1
A standard coding
scheme
Character
Code
Frequency
Total Bits
a
000
10
30
e
001
15
45
i
010
12
36
s
011
3
9
t
100
4
12
sp
101
13
39
nl
110
1
3
Total
174
 
 
 
Search WWH ::




Custom Search