Introduction - Introduction to Data Compression

Databases Reference

In-Depth Information

samples per second, 16 bits per sample) requires more than 84 million bits. Downloading

music from a website at these rates would take a long time.

As human activity has a greater and greater impact on our environment, there is an ever-

increasing need for more information about our environment, how it functions, and what we

are doing to it. Various space agencies from around the world, including the European Space

Agency (ESA), the National Aeronautics and Space Administration (NASA), the Canadian

Space Agency (CSA), and the Japan Aerospace Exploration Agency (JAXA), are collaborating

on a program to monitor global change that will generate half a terabyte of data per day when it

is fully operational. New sequencing technology is resulting in ever-increasing database sizes

containing genomic information while new medical scanning technologies could result in the

generation of petabytes 1 of data.

Given the explosive growth of data that needs to be transmitted and stored, why not focus

on developing better transmission and storage technologies? This is happening, but it is

not enough. There have been significant advances that permit larger and larger volumes of

information to be stored and transmitted without using compression, including CD-ROMs,

optical fibers, Asymmetric Digital Subscriber Lines (ADSL), and cable modems. However,

while it is true that both storage and transmission capacities are steadily increasing with new

technological innovations, as a corollary to Parkinson's First Law, 2 it seems that the need

for mass storage and transmission increases at least twice as fast as storage and transmission

capacities improve. Then there are situations in which capacity has not increased significantly.

For example, the amount of information we can transmit over the airwaves will always be

limited by the characteristics of the atmosphere.

An early example of data compression is Morse code, developed by Samuel Morse in the

mid-19th century. Letters sent by telegraph are encoded with dots and dashes. Morse noticed

that certain letters occurred more often than others. In order to reduce the average time required

to send a message, he assigned shorter sequences to letters that occur more frequently, such as

e (

·

·−

−−·−

) and a (

), and longer sequences to letters that occur less frequently, such as q (

)

and j (

). This idea of using shorter codes for more frequently occurring characters is

used in Huffman coding, which we will describe in Chapter 3.

Where Morse code uses the frequency of occurrence of single characters, a widely used

form of Braille code, which was also developed in the mid-19th century, uses the frequency

of occurrence of words to provide compression [ 1 ]. In Braille coding, 2

·−−−

3 arrays of dots are

used to represent text. Different letters can be represented depending on whether the dots are

raised or flat. In Grade 1 Braille, each array of six dots represents a single character. However,

given six dots with two positions for each dot, we can obtain 2 6 , or 64, different combinations.

If we use 26 of these for the different letters, we have 38 combinations left. In Grade 2 Braille,

some of these leftover combinations are used to represent words that occur frequently, such

as “and” and “for.” One of the combinations is used as a special symbol indicating that the

symbol that follows is a word and not a character, thus allowing a large number of words to be

×

1 mega: 10 6 , giga: 10 9 , tera: 10 12 , peta: 10 15 ,exa:10 18 , zetta: 10 21 , yotta: 10 24

2 Parkinson's First Law: “Work expands so as to fill the time available,” in Parkinson's Law and Other Studies in

Administration , by Cyril Northcote Parkinson, Ballantine Books, New York, 1957.

Introduction to Data Compression

Search WWH ::

Custom Search

Home