Title: 15C: Caesar's JVTIVK JRCRU IVTZGV
Content:
Exercises 15A, 15B, and 15C can be completed in any order.
Cryptography is the art and science of hiding the meaning of information, in a way that only some people can see it. In this lesson we will introduce one of the simplest cryptographic methods, the Caesar cipher (also known as a shift cipher), and you will to write a program to break it. You will design every aspect of your solution (unlike lesson 15A, which we broke into sub-parts).
The Caesar cipher works by replacing each letter of the alphabet by another letter. To be precise, whenever you want to encode some text you need to pick a shift value S, which is a number between 0 and 25. Then, you replace each letter in the text with the letter which is S positions later in the alphabet, circling around to the start after you reach Z at the end of the alphabet.
Example
Suppose we want to encode the secret message
JOIN ME AT EIGHT BY THE ZOO
using the shift value S=2. The encryption rule says each letter is replaced by the one which occurs 2 positions later in the alphabet. For example, since the alphabet is ABCDEFGHIJKL..., the first letter J
will be replaced by the letter L
. Continuing, the O
is replaced by Q
, the I
is replaced by K
, et cetera. To encode the letter Y, we have to circle back to the start: after Y comes Z, then A, so Y
is replaced by A
. Likewise Z
is replaced by B
. So the encoded version of our secret message is:
LQKP OG CV GKIJV DA VJG BQQ
If some spy saw this message, it would not be obvious to them what your message was about.
SPY CODER
with the shift value S=5? (Use uppercase.) You can write a program in the console to compute this, if you like — it might be useful later.Decoding
Once your friend gets the message, if they know the secret shift value S then it is easy for them to decipher the message: each letter is replaced by the one appearing S places earlier in the alphabet. For example, they would look at L
, step back two positions in the alphabet and find J
, which they now know is the first letter of your secret message. Again, they have to treat the alphabet as cyclic, assuming that Z
comes before A
.
HUD
, and the shift value is S=6, what was the original message? (Use uppercase.)Working for the Bad Guys
You have been hired by a spy to help decipher a message encrypted using a shift cipher. Sadly, the spy does not know the value of S. You will use statistics to write a program that has a good chance to automatically find the right value of S.
Our method will be to use letter frequency analysis. In English, some letters are very common (like E) and some are very rare (like J). Here is a table of frequencies, based on counting the occurrences in a large body of text.
A | B | C | D | E | F | G | H | I |
---|---|---|---|---|---|---|---|---|
.0817 | .0149 | .0278 | .0425 | .1270 | .0223 | .0202 | .0609 | .0697 |
J | K | L | M | N | O | P | Q | R |
.0015 | .0077 | .0402 | .0241 | .0675 | .0751 | .0193 | .0009 | .0599 |
S | T | U | V | W | X | Y | Z | |
.0633 | .0906 | .0276 | .0098 | .0236 | .0015 | .0197 | .0007 |
For example, the frequency of L is .0402=4.02%, which means that in average English text, 4.02% of the letters are L.
Call the goodness of a letter its value in the chart above. Then for our statistical method, define the goodness of a sentence to equal the sum of the goodness of each of its letters. For example the goodness of the string GOOD
is
goodness("GOOD
") = .0202 + .0751 + .0751 + .0425 = .2129
Now, the idea behind frequency analysis is that strings with higher goodness are more likely to represent English text. For example, if the spy sees the encoded message "JRRG
", this might either represent the original text "GOOD
" with shift S=3, or "IQQF
" with S=1. But the goodness of "IQQF
" is
goodness("IQQF
") = .0697 + .0009 + .0009 + .0223 = .0938
and since .0938 < .2129, your program should conclude that GOOD
is more likely to be the correct message than IQQF
.
Measuring strings by goodness is not perfect. Say the secret message was JAZZ and the secret shift value was S=10, so the encoded text was TKJJ . When you input TKJJ to your solver it will try S=10, giving JAZZ as a possibility with goodness .0846, but the best is S=5, giving OFEE with goodness of .3514 as the program's best guess for the secret message. Your program should go ahead and output the non-English word OFEE . (Frequency analysis of this type works better if the secret message is longer.) |