Question 1

What is the ETAOIN SHRDLU order?

Accepted Answer

ETAOIN SHRDLU is the approximate order of the 12 most common letters in English (E, T, A, O, I, N, S, H, R, D, L, U). It became famous because Linotype machines arranged keys in this order for efficiency. The phrase "etaoin shrdlu" appeared occasionally in newspapers as a "slug" (filler text) when typesetters accidentally hit keys in sequence. Actual frequencies vary by corpus (novels vs. technical writing vs. news).

Question 2

How does frequency analysis break a substitution cipher?

Accepted Answer

In a simple substitution cipher, each plaintext letter maps consistently to one ciphertext letter. Frequency analysis exploits the fact that this mapping preserves letter frequencies — the most common ciphertext letter corresponds to the most common plaintext letter (E in English). Starting points: (1) most common ciphertext letter → E; (2) most common 2-letter combo → TH or HE; (3) most common 3-letter combo → THE. Build from there, using word patterns and known digraph/trigraph frequencies to confirm.

Question 3

What is Zipf's Law and how does it relate to character frequency?

Accepted Answer

Zipf's Law states that in natural language, the frequency of any word is inversely proportional to its rank in the frequency table: the most common word appears ~2× as often as the 2nd most common, ~3× as often as the 3rd, etc. This power-law distribution applies to words, not characters — character frequency is more uniform. However, character bigram (2-char) frequencies follow Zipf-like distributions: "TH" is far more common than "XQ".

Question 4

What is Huffman coding and how does character frequency inform it?

Accepted Answer

Huffman coding is a lossless data compression algorithm that assigns shorter binary codes to more frequent characters. Algorithm: (1) count character frequencies; (2) build a priority queue (min-heap) with characters as nodes; (3) repeatedly combine the two lowest-frequency nodes into a parent node; (4) assign 0/1 to left/right branches. Result: "E" might be encoded as "01" (2 bits) while "Q" is encoded as "1110100" (7 bits). Used in deflate (ZIP, gzip, PNG), JPEG, MP3.

Character Frequency

What is it and how does it work?

Common use cases

Frequently asked questions

What is the ETAOIN SHRDLU order?

How does frequency analysis break a substitution cipher?

What is Zipf's Law and how does it relate to character frequency?

What is Huffman coding and how does character frequency inform it?

Text