*Exercises 15A, 15B, and 15C can be completed in any order.*

Cryptography is the art and science of hiding the meaning of information, in a way that only some people can see it. In this lesson we will introduce one of the simplest cryptographic methods, the **Caesar cipher **(also known as a **shift cipher**), and you will to write a program to break it. You will design every aspect of your solution (unlike lesson 15A, which we broke into sub-parts).

The Caesar cipher works by replacing each letter of the alphabet by another letter. To be precise, whenever you want to encode some text you need to pick a **shift value*** S*, which is a number between 0 and 25. Then, you replace each letter in the text with the letter which is *S* positions later in the alphabet, circling around to the start after you reach Z at the end of the alphabet.

## Example

Suppose we want to encode the secret message

`JOIN ME AT EIGHT BY THE ZOO`

using the shift value *S*=2. The encryption rule says each letter is replaced by the one which occurs 2 positions later in the alphabet. For example, since the alphabet is ABCDEFGHI**J**K**L**..., the first letter `J`

will be replaced by the letter `L`

. Continuing, the `O`

is replaced by `Q`

, the `I`

is replaced by `K`

, et cetera. To encode the letter Y, we have to circle back to the start: after Y comes Z, then A, so `Y`

is replaced by `A`

. Likewise `Z`

is replaced by `B`

. So the encoded version of our secret message is:

`LQKP OG CV GKIJV DA VJG BQQ`

If some spy saw this message, it would not be obvious to them what your message was about.

`SPY CODER`

with the shift value *S*=5? (Use uppercase.) You can write a program in the console to compute this, if you like — it might be useful later.

## Decoding

Once your friend gets the message, if they know the secret shift value *S* then it is easy for them to decipher the message: each letter is replaced by the one appearing *S* places *earlier* in the alphabet. For example, they would look at `L`

, step back two positions in the alphabet and find `J`

, which they now know is the first letter of your secret message. Again, they have to treat the alphabet as cyclic, assuming that `Z`

comes before `A`

.

`HUD`

, and the shift value is *S*=6, what was the original message? (Use uppercase.)

## Working for the Bad Guys

You have been hired by a spy to help decipher a message encrypted using a shift cipher. Sadly, the spy does not know the value of *S*. You will use statistics to write a program that has a good chance to automatically find the right value of *S*.

Our method will be to use **letter frequency analysis**. In English, some letters are very common (like E) and some are very rare (like J). Here is a table of frequencies, based on counting the occurrences in a large body of text.

A | B | C | D | E | F | G | H | I |
---|---|---|---|---|---|---|---|---|

.0817 | .0149 | .0278 | .0425 | .1270 | .0223 | .0202 | .0609 | .0697 |

J | K | L | M | N | O | P | Q | R |

.0015 | .0077 | .0402 | .0241 | .0675 | .0751 | .0193 | .0009 | .0599 |

S | T | U | V | W | X | Y | Z | |

.0633 | .0906 | .0276 | .0098 | .0236 | .0015 | .0197 | .0007 |

For example, the frequency of L is .0402=4.02%, which means that in average English text, 4.02% of the letters are L.

Call the **goodness** of a letter its value in the chart above. Then for our statistical method, define the **goodness** of a sentence to equal the sum of the goodness of each of its letters. For example the goodness of the string `GOOD`

is

goodness("`GOOD`

") = .0202 + .0751 + .0751 + .0425 = .2129

Now, the idea behind frequency analysis is that strings with higher goodness are more likely to represent English text. For example, if the spy sees the encoded message "`JRRG`

", this might either represent the original text "`GOOD`

" with shift *S*=3, or "`IQQF`

" with *S*=1. But the goodness of "`IQQF`

" is

goodness("`IQQF`

") = .0697 + .0009 + .0009 + .0223 = .0938

and since .0938 < .2129, your program should conclude that `GOOD`

is more likely to be the correct message than `IQQF`

.

Measuring strings by goodness is not perfect. Say the secret message was `JAZZ` and the secret shift value was S=10, so the encoded text was `TKJJ` . When you input `TKJJ` to your solver it will try S=10, giving `JAZZ` as a possibility with goodness .0846, but the best is S=5, giving `OFEE` with goodness of .3514 as the program's best guess for the secret message. Your program should go ahead and output the non-English word `OFEE` . (Frequency analysis of this type works better if the secret message is longer.) |

The solution above is called a brute force solution since you just tried all 26 possibilities. You could have even done this by hand if you needed to. Modern cryptographic protocols, on the other hand, are designed so that brute force solutions fail, even if you use very fast computers to try all possibilities. |

If you wanted to make this auto-decoding system more accurate, you could pay attention to other English statistics, such as "what letters are likely to occur in pairs next to each other? what letters are likely to start a word?" This is also useful for more general substitution ciphers, where letters replace letters but not in a cyclic way. For such ciphers brute force does not work since there are 26*25*...*1 possible substitution ciphers, which equals approximately 4 * 10^{26} (four hundred septillion).

You have completed this lesson! Feel free to send your friends an encrypted message to celebrate.