Character Frequency

Analyze character frequency in text with bar chart visualization

What is it and how does it work?

Character frequency analysis counts how often each character appears in a text, expressing results as counts and percentages. This is one of the oldest techniques in cryptanalysis — Arab mathematician Al-Kindi described frequency analysis in the 9th century, and it was the primary method for breaking monoalphabetic substitution ciphers for over 1000 years. In English text, the expected character frequencies are well-established: E (~12.7%), T (~9.1%), A (~8.2%), O (~7.5%), I (~7.0%), N (~6.7%), S (~6.3%), H (~6.1%), R (~6.0%).

Beyond cryptography, character frequency analysis is used in data compression (Huffman coding assigns shorter codes to more frequent characters), natural language processing (language identification from character n-grams), text normalization (detecting unusual character distributions that signal encoding problems or non-standard text), and typographic analysis.

Common use cases

Frequently asked questions

What is the ETAOIN SHRDLU order?

ETAOIN SHRDLU is the approximate order of the 12 most common letters in English (E, T, A, O, I, N, S, H, R, D, L, U). It became famous because Linotype machines arranged keys in this order for efficiency. The phrase "etaoin shrdlu" appeared occasionally in newspapers as a "slug" (filler text) when typesetters accidentally hit keys in sequence. Actual frequencies vary by corpus (novels vs. technical writing vs. news).

How does frequency analysis break a substitution cipher?

In a simple substitution cipher, each plaintext letter maps consistently to one ciphertext letter. Frequency analysis exploits the fact that this mapping preserves letter frequencies — the most common ciphertext letter corresponds to the most common plaintext letter (E in English). Starting points: (1) most common ciphertext letter → E; (2) most common 2-letter combo → TH or HE; (3) most common 3-letter combo → THE. Build from there, using word patterns and known digraph/trigraph frequencies to confirm.

What is Zipf's Law and how does it relate to character frequency?

Zipf's Law states that in natural language, the frequency of any word is inversely proportional to its rank in the frequency table: the most common word appears ~2× as often as the 2nd most common, ~3× as often as the 3rd, etc. This power-law distribution applies to words, not characters — character frequency is more uniform. However, character bigram (2-char) frequencies follow Zipf-like distributions: "TH" is far more common than "XQ".

What is Huffman coding and how does character frequency inform it?

Huffman coding is a lossless data compression algorithm that assigns shorter binary codes to more frequent characters. Algorithm: (1) count character frequencies; (2) build a priority queue (min-heap) with characters as nodes; (3) repeatedly combine the two lowest-frequency nodes into a parent node; (4) assign 0/1 to left/right branches. Result: "E" might be encoded as "01" (2 bits) while "Q" is encoded as "1110100" (7 bits). Used in deflate (ZIP, gzip, PNG), JPEG, MP3.

Text

Uppercase / Lowercase · Word Counter · Character Counter · Lorem Ipsum Generator · Remove Extra Spaces · Sort Text Lines