Part 1: Prefix Codes and Huffman Decoding

Optional Video

The video is optional. Use it if another explanation would help before or after a section.

The timestamp links point to the parts that match the sections below.

0:00-5:31 optional

Compression and Ambiguity

The basic problem in lossless compression: a decoder must recover one well-defined original message.

14:18-16:20 optional

Frequent Symbols

Why shorter codes are most useful when they are assigned to symbols that appear often.

18:11-20:14 optional

Prefix-Free Trees

How paths in a binary tree define prefix-free codes and support left-to-right decoding.

20:15-22:37 optional

Weighted Cost

How code length and frequency combine to measure the total number of bits used.

22:38-27:45 preview

Building the Tree

A preview of the construction algorithm. This is useful context, but the first project provides the tree.

Key Terms

In what follows, we use these terms without relying on the video.

Symbol

A unit of data being encoded. In these examples this might be a letter; in the project it will be an RGB pixel color.

Code

The binary string assigned to a symbol. For instance, we might have A = 0 or blue = 11.

Leaf

A tree node that carries a symbol. Reaching a leaf tells the decoder that one complete symbol has been found.

1. Compression, Fixed-Length Codes, and Ambiguity

We begin with the fact that computers store data as bits. If every symbol uses the same number of bits, decoding is straightforward because the decoder can split the bitstring into equal-sized chunks. Compression becomes possible when common symbols use shorter bitstrings and rare symbols use longer bitstrings.

A fixed-length code assigns the same number of bits to every symbol. With four symbols, two bits are enough because the possible two-bit strings are 00, 01, 10, and 11.

Symbol	Code
A	`00`
B	`01`
C	`10`
D	`11`

Example. To encode BAD, we replace each character with its code: B is 01, A is 00, and D is 11. The encoded bitstring is 010011.

Because each character costs 2 bits, this three-character message costs 3 x 2 = 6 bits.

Practice. With A=00, B=01, C=10, and D=11, encode ABCD.

Why variable-length codes can be ambiguous

A variable-length code is not automatically safe. If one complete code is the beginning of another complete code, the decoder may not know where to stop.

Symbol	Code
A	`0`
B	`01`
C	`10`

Notice that the code for A, 0, is the beginning of the code for B, 01. This creates ambiguity.

Practice. If one encoded bitstring can decode into two different messages, which requirement has failed?

The code is not fixed length. The code is not uniquely decodable. The code is not binary.

Practice. With A=0, B=01, and C=10, the bitstring 010 is ambiguous.

2. Frequent Symbols Should Usually Have Shorter Codes

Once symbols occur at different rates, fixed-length codes can be wasteful. A shorter code is most useful when it belongs to a symbol that appears often.

Example. Suppose A appears 60 times and B appears 5 times. Saving one bit on A saves 60 bits total, while saving one bit on B saves only 5 bits total. For this reason, frequent symbols usually belong closer to the root.

Practice. If symbol A appears 60 times and symbol B appears 5 times, which symbol should usually receive the shorter code?

Practice. Using frequencies A=50, B=25, C=15, and D=10, which codebook is better?

Codebook 1: A=0, B=10, C=110, D=111

Codebook 2: A=00, B=01, C=10, D=11

Codebook 1, because A is most frequent and has the shortest code. Codebook 2, because equal-length codes are always better. They are the same because both use binary strings.

3. Prefix-Free Codes Come from Tree Paths

The natural question is how a decoder can safely use variable-length codes. The decoder needs a clean stopping rule, and a prefix-free codebook gives one.

We say that a codebook is prefix-free if no symbol's complete code is the beginning of another symbol's complete code. Prefix-free codes can be decoded from left to right without separators.

Symbol	Code	Length
A	`0`	1
B	`10`	2
C	`110`	3
D	`111`	3

Example. Decode 0100 from left to right. The first 0 is A. The next bits 10 are B. The final 0 is A. Thus 0100 decodes to ABA.

Practice. Using A=0, B=10, C=110, and D=111, decode 0110111100.

4. Binary Trees as Codebooks

A binary tree gives a concrete way to define a prefix-free code. Starting at the root, each left edge appends 0 and each right edge appends 1. A symbol's code is the path from the root to its leaf.

Derived Code Table

Symbol	Path	Code
A	left	`0`
B	right, left	`10`
C	right, right, left	`110`
D	right, right, right	`111`

Why the tree is prefix-free. Symbols only appear at leaves. If A is a leaf, the path to A stops there; no longer symbol can continue below A. This is why a leaf-based codebook avoids the ambiguity from Section 1.

Practice. Using red=0, green=10, blue=110, and gold=111, decode 010111.

5. Interactive Decoder Walk

The stepper shows the decoding algorithm in miniature. A 0 moves left, and a 1 moves right. When the walk reaches a leaf, one symbol has been decoded, and the walk returns to the root.

6. Frequency and Weighted Path Length

Code length is exactly tree depth. A symbol at depth 1 costs 1 bit each time it appears. A symbol at depth 3 costs 3 bits each time it appears.

Compression depends on how often each code is used. The cost contribution for one symbol is frequency x code length. The total weighted path length is the sum of these contributions.

Symbol	Frequency	Code	Code Length	Frequency x Length
A	50	`0`	1	50
B	25	`10`	2	50
C	15	`110`	3	45
D	10	`111`	3	30

Total cost: 50 + 50 + 45 + 30 = 175 bits per 100 symbols.

Average cost: 175 / 100 = 1.75 bits per symbol.

Practice. If X appears 8 times at depth 1 and Y appears 2 times at depth 3, what is the total weighted cost?

Recalculate with new frequencies

Here the code lengths stay fixed as A=1, B=2, C=3, and D=3.

A frequency

B frequency

C frequency

D frequency

Total cost: 175 bits. Average cost: 1.75 bits per symbol.

7. Lossy Binning Before Huffman Coding

Huffman coding is lossless for the symbols it receives. If the symbols are exact RGB pixels, then decoding should recover those exact RGB pixels.

From a practical point of view, image compression can also include a separate step before Huffman coding: nearby colors are placed in the same bin. This is lossy because the original colors are rounded. After that rounding, Huffman coding can still decode the binned image exactly.

Example. Dropping 4 low-order bits from each color channel keeps the rough color while removing small differences. The value 213 has binary form 11010101. Keeping the top four bits gives 1101, and padding with zeros gives 11010000, which is 208.

In this case, (213, 66, 73) becomes (208, 64, 64).

Lossless symbols

Small color differences remain distinct, so the tree needs more leaves.

(210, 64, 70)
(213, 66, 73)
(218, 71, 69)
(221, 78, 74)
(209, 70, 76)
(216, 69, 67)
(42, 120, 202)
(244, 202, 62)

Lossy-binned symbols

Nearby colors are represented by one rounded symbol, so the tree can be smaller.

(208, 64, 64) for six red-like pixels
(32, 112, 192) for one blue pixel
(240, 192, 48) for one gold pixel

Version	What counts as a symbol	Tree leaves	Deepest leaf	Average code length
Lossless	Exact RGB pixels	8	3	3.00 bits per pixel
Lossy-binned	Rounded RGB pixels	3	2	1.25 bits per pixel

Notice that the smaller tree does not mean the Huffman step lost information. The information was simplified before Huffman coding began. The decoded lossy image should match the binned image exactly, not the original image exactly.

Practice. Drop 4 low-order bits from each channel in (213, 66, 73). What binned RGB value do you get?

Practice. Which version is more likely to have fewer leaves in its Huffman tree?

The lossless exact-color version. The lossy-binned version. They must always have the same number of leaves.

8. Other References

Use these references for another explanation after working through the activity above.

Full Reducible video. Watch the full version for the information-theory story.
How Huffman Trees Work - Computerphile. Watch for how paths through the tree become codes.
How Computers Compress Text: Huffman Coding and Huffman Trees - Tom Scott. This gives a short conceptual explanation.
Binary Trees, Part 1 - MIT OpenCourseWare. Review depth and height, then connect that to decoding cost.
CS Field Guide: Huffman Coding. This is a useful written reference.

9. Readiness Check

Before moving on to the project, make sure these statements feel true. If not, review the relevant section above.

I can explain why A=0 and B=01 creates ambiguity. I can read codes from a tree using left=0 and right=1. I can trace a bitstring from the root to leaves and reset after each leaf. I can compute frequency x code length and average bits per symbol. I can explain why binning nearby colors can reduce the number of leaves in a Huffman tree. I understand that the project provides the tree and asks me to implement decoding. I know that the project output is an image built from decoded RGB pixels.

Common mistake to avoid. Do not append a decoded symbol after every bit. Append only when the tree walk reaches a leaf, and then reset to the root.

One-sentence reflection: why does a Huffman decoder know when one symbol is complete?