Cryptography Reference
In-Depth Information
An adequate representation of the English language is The Complete Works of William Shakespeare [3]. We
can easily calculate the index of coincidence, ignoring punctuation and spaces, by counting the occurrences of
each character and applying the above formula. In this case, we calculate it to be approximately 0.0639.
While Shakespeare provides an interesting reference point and is fairly representative of English, it is neces-
sary to consider the source of the message you are analyzing. For example, if your source text likely is C code,
a better reference might be a large collection of C code, such as the Linux kernel. The Linux 2.6.15.1 kernel has
an I C ≈ 0.0585. Or, if the text is in Klingon, we can take a sample size of Klingon with a few English loan words
(taken from about 156 kilobytes of the Qo'noS Qonos ), and find the I C ≈ 0.0496.
The theoretically perfect I C is if all characters occurred the exact same number of times so that none was
more likely than any other to be repeated. This can be easily calculated. For English, since we have 26 char-
acters in our Latin-based alphabet, the perfect value would be that each character occurs exactly 1/26-th of the
time. This means that, in the above equation, we can assume that length = 26 × count( c ) for all c .
This gives us the following formula to calculate the perfect theoretical maximum. We can assume that the
count is n, to make the formula easier to read. To see what happens as we get more and more ciphertext, the
counts will be more precise; therefore, we will assume that the amount of ciphertext is approaching an infinite
amount.
We can simplify this a little (since we know that each part of the sum is always the same):
And we can even simplify a little further:
Most calculus courses teach L'Hôpital's Rule, which tells us that the above limit can be simplified again,
giving our theoretical best:
I C = 1/26 ≈ 0.03846
This can be seen intuitively by the fact that, as n gets very large, the subtraction of the constant 1 means very
little to the value of the fraction, which is dominated by the n /26 n part. This is simplified to 1/26.
Note that this technique does not allow us to actually break a cipher. This is simply a tool to provide us more
information about the text with which we are dealing.
1.5.1.3 Other Issues
There are some proposed methods of strengthening basic ciphers (monoalphabetic, polyalphabetic, transposi-
tion, or others). See Reference [5] for some of these examples.
One very simple method is to throw meaningless characters called nulls into the ciphertext. For example, the
character X does not appear very often in texts. Therefore, we could just throw the letter X randomly into the
plaintext before encrypting. This technique isn't terribly difficult to spot: Frequency analysis will show a fairly
normal distribution of characters, except for an extra, large spike in the distribution. Once any suspected nulls
are removed, the analysis should be easier. Another common null is to remove spaces from the plaintext and
add them to the ciphertext in a random, English-like manner.
Search WWH ::




Custom Search