Part 1: standalone activity

Prefix Codes and Huffman Decoding

This page develops the basic ideas we need for Huffman decoding. We will move from fixed-length codes to prefix-free tree paths, and then to decoding by tracing bits through a tree.

How to work through this page

  • Read the explanation for each idea.
  • Compare it with the worked example in that section.
  • Answer the practice questions directly below it.
  • Finish by checking that you can trace a bitstring through a tree.

Optional Video

The video is optional. Use it if another explanation would help before or after a section.

The timestamp links point to the parts that match the sections below.

0:00-5:31 optional

Compression and Ambiguity

The basic problem in lossless compression: a decoder must recover one well-defined original message.

14:18-16:20 optional

Frequent Symbols

Why shorter codes are most useful when they are assigned to symbols that appear often.

18:11-20:14 optional

Prefix-Free Trees

How paths in a binary tree define prefix-free codes and support left-to-right decoding.

20:15-22:37 optional

Weighted Cost

How code length and frequency combine to measure the total number of bits used.

22:38-27:45 preview

Building the Tree

A preview of the construction algorithm. This is useful context, but the first project provides the tree.

Key Terms

In what follows, we use these terms without relying on the video.

Symbol

A unit of data being encoded. In these examples this might be a letter; in the project it will be an RGB pixel color.

Code

The binary string assigned to a symbol. For instance, we might have A = 0 or blue = 11.

Leaf

A tree node that carries a symbol. Reaching a leaf tells the decoder that one complete symbol has been found.

1. Compression, Fixed-Length Codes, and Ambiguity

We begin with the fact that computers store data as bits. If every symbol uses the same number of bits, decoding is straightforward because the decoder can split the bitstring into equal-sized chunks. Compression becomes possible when common symbols use shorter bitstrings and rare symbols use longer bitstrings.

A fixed-length code assigns the same number of bits to every symbol. With four symbols, two bits are enough because the possible two-bit strings are 00, 01, 10, and 11.

SymbolCode
A00
B01
C10
D11

Example. To encode BAD, we replace each character with its code: B is 01, A is 00, and D is 11. The encoded bitstring is 010011.

Because each character costs 2 bits, this three-character message costs 3 x 2 = 6 bits.

Why variable-length codes can be ambiguous

A variable-length code is not automatically safe. If one complete code is the beginning of another complete code, the decoder may not know where to stop.

SymbolCode
A0
B01
C10

Notice that the code for A, 0, is the beginning of the code for B, 01. This creates ambiguity.

Practice. If one encoded bitstring can decode into two different messages, which requirement has failed?

Practice. With A=0, B=01, and C=10, the bitstring 010 is ambiguous.

2. Frequent Symbols Should Usually Have Shorter Codes

Once symbols occur at different rates, fixed-length codes can be wasteful. A shorter code is most useful when it belongs to a symbol that appears often.

Example. Suppose A appears 60 times and B appears 5 times. Saving one bit on A saves 60 bits total, while saving one bit on B saves only 5 bits total. For this reason, frequent symbols usually belong closer to the root.

Practice. Using frequencies A=50, B=25, C=15, and D=10, which codebook is better?

Codebook 1: A=0, B=10, C=110, D=111

Codebook 2: A=00, B=01, C=10, D=11

3. Prefix-Free Codes Come from Tree Paths

The natural question is how a decoder can safely use variable-length codes. The decoder needs a clean stopping rule, and a prefix-free codebook gives one.

We say that a codebook is prefix-free if no symbol's complete code is the beginning of another symbol's complete code. Prefix-free codes can be decoded from left to right without separators.

SymbolCodeLength
A01
B102
C1103
D1113

Example. Decode 0100 from left to right. The first 0 is A. The next bits 10 are B. The final 0 is A. Thus 0100 decodes to ABA.

4. Binary Trees as Codebooks

A binary tree gives a concrete way to define a prefix-free code. Starting at the root, each left edge appends 0 and each right edge appends 1. A symbol's code is the path from the root to its leaf.

Encoding tree for A, B, C, and D The root has A as its left child and an internal node as its right child. The internal node has B as its left child and another internal node as its right child. The final internal node has C as its left child and D as its right child. 0 1 0 1 0 1 root A node B node C D

Derived Code Table

SymbolPathCode
Aleft0
Bright, left10
Cright, right, left110
Dright, right, right111

Why the tree is prefix-free. Symbols only appear at leaves. If A is a leaf, the path to A stops there; no longer symbol can continue below A. This is why a leaf-based codebook avoids the ambiguity from Section 1.

5. Interactive Decoder Walk

The stepper shows the decoding algorithm in miniature. A 0 moves left, and a 1 moves right. When the walk reaches a leaf, one symbol has been decoded, and the walk returns to the root.

0 1 0 1 0 1 root red node green node blue gold

6. Frequency and Weighted Path Length

Code length is exactly tree depth. A symbol at depth 1 costs 1 bit each time it appears. A symbol at depth 3 costs 3 bits each time it appears.

Compression depends on how often each code is used. The cost contribution for one symbol is frequency x code length. The total weighted path length is the sum of these contributions.

Symbol Frequency Code Code Length Frequency x Length
A500150
B2510250
C15110345
D10111330

Total cost: 50 + 50 + 45 + 30 = 175 bits per 100 symbols.

Average cost: 175 / 100 = 1.75 bits per symbol.

Recalculate with new frequencies

Here the code lengths stay fixed as A=1, B=2, C=3, and D=3.

7. Lossy Binning Before Huffman Coding

Huffman coding is lossless for the symbols it receives. If the symbols are exact RGB pixels, then decoding should recover those exact RGB pixels.

From a practical point of view, image compression can also include a separate step before Huffman coding: nearby colors are placed in the same bin. This is lossy because the original colors are rounded. After that rounding, Huffman coding can still decode the binned image exactly.

Example. Dropping 4 low-order bits from each color channel keeps the rough color while removing small differences. The value 213 has binary form 11010101. Keeping the top four bits gives 1101, and padding with zeros gives 11010000, which is 208.

In this case, (213, 66, 73) becomes (208, 64, 64).

Lossless symbols

Small color differences remain distinct, so the tree needs more leaves.

  • (210, 64, 70)
  • (213, 66, 73)
  • (218, 71, 69)
  • (221, 78, 74)
  • (209, 70, 76)
  • (216, 69, 67)
  • (42, 120, 202)
  • (244, 202, 62)

Lossy-binned symbols

Nearby colors are represented by one rounded symbol, so the tree can be smaller.

  • (208, 64, 64) for six red-like pixels
  • (32, 112, 192) for one blue pixel
  • (240, 192, 48) for one gold pixel
Version What counts as a symbol Tree leaves Deepest leaf Average code length
LosslessExact RGB pixels833.00 bits per pixel
Lossy-binnedRounded RGB pixels321.25 bits per pixel

Notice that the smaller tree does not mean the Huffman step lost information. The information was simplified before Huffman coding began. The decoded lossy image should match the binned image exactly, not the original image exactly.

Practice. Which version is more likely to have fewer leaves in its Huffman tree?

8. Other References

Use these references for another explanation after working through the activity above.

9. Readiness Check

Before moving on to the project, make sure these statements feel true. If not, review the relevant section above.

Common mistake to avoid. Do not append a decoded symbol after every bit. Append only when the tree walk reaches a leaf, and then reset to the root.