Cryptography Reference
In-Depth Information
H L
H ( X )
4 . 14
Hence, instead of 4.7 bits of information per letter, we have around 4.14 bits
of information per letter if we take into account the (statistical) letter frequencies
of the English language. But this is still an overestimate, because the letters are not
independent. For example, in the English language a Q is always followed by a U,
and the bigram TH is likely to occur frequently. So one would suspect that a better
statistic for the amount of entropy per letter could be obtained by looking at the
distribution of bigrams (instead of letters). If X 2 denotes the random variable of
bigrams in the English language, then we can refine the upper bound for H L :
H ( X 2 )
2
H L
3 . 56
This can be continued with trigrams and—more generally— n -grams. In the
most general case, the entropy of the language L is defined as follows:
H ( X n )
n
H L = lim
n→∞
The exact value of H L is hard to determine. All statistical investigations show
that
1 . 0
H L
1 . 5
for the English language (e.g., [6]). So each letter in an English text gives at most
1.5 bits of information. This implies that the English language (like all natural
languages) contains a high degree of redundancy. The redundancy of language L ,
denoted by R L , is defined as follows:
H L
|
R L =1
Σ
|
In the case of the English language, we have H L
1 . 25 and
|
Σ
|
=log 2 26
4 . 7. So the redundancy of the English language is
Search WWH ::




Custom Search