Information Theory - Contemporary Cryptography

Cryptography Reference

In-Depth Information

H L ≤

H ( X )

≈

4 . 14

Hence, instead of 4.7 bits of information per letter, we have around 4.14 bits

of information per letter if we take into account the (statistical) letter frequencies

of the English language. But this is still an overestimate, because the letters are not

independent. For example, in the English language a Q is always followed by a U,

and the bigram TH is likely to occur frequently. So one would suspect that a better

statistic for the amount of entropy per letter could be obtained by looking at the

distribution of bigrams (instead of letters). If X 2 denotes the random variable of

bigrams in the English language, then we can refine the upper bound for H L :

H ( X 2 )

2

H L ≤

≈

3 . 56

This can be continued with trigrams and—more generally— n -grams. In the

most general case, the entropy of the language L is defined as follows:

H ( X n )

n

H L = lim

n→∞

The exact value of H L is hard to determine. All statistical investigations show

that

1 . 0

≤

H L ≤

1 . 5

for the English language (e.g., [6]). So each letter in an English text gives at most

1.5 bits of information. This implies that the English language (like all natural

languages) contains a high degree of redundancy. The redundancy of language L ,

denoted by R L , is defined as follows:

H L

|

R L =1

−

Σ

|

In the case of the English language, we have H L ≈

1 . 25 and

|

Σ

|

=log 2 26

≈

4 . 7. So the redundancy of the English language is

Contemporary Cryptography

Search WWH ::

Custom Search

Home