Decode Huffman encoding in HTTP/2
I'm trying to understand how to do Huffman decoding of an HTTP/2 header. Most of the docs I see talk about having a binary tree for the frequency table, but for HTTP/2 it's just a static lookup table. I've got the encoding part working fine, but the decoding is confusing me as I don't know how to tell how many bits I'm supposed to take each time.
1 answer

A Huffman code is a prefixfree code. This means that no encoded symbol is a prefix of any other encoded symbol.
If there is a symbol represented by the bit string
00111
, then there can't be one represented by001110
or001111
or0011100110011
 nothing else will start with00111
. If you've read the bit string00111
, you don't need a marker to tell you that you're at the end of a symbol. If you can produce an output symbol from the bits you've read so far, you must produce that output symbol and start reading the next one.When you've read
0011
, you won't be able to output anything because0011
is not a symbol. It can't be, because it's a prefix of00111
.Huffman codes always assign some meaning to every possible bit string (except one that ends prematurely). The fact that
0011
has no meaning by itself means that there must be at least 2 symbol codes starting with0011
. At least one will start with00110
and at least one will start with00111
.To decode an input bit stream, you start from a state representing no input, then read a bit, and move to the state representing the bits you've read so far. For example, if you're in state
00
and you read a 1, you move to state001
. When you reach a state that corresponds to a symbol, you output that symbol and move back to the initial state before reading the next bit.(Note that detecting the end of stream is outside the scope of Huffman coding. The containing protocol must tell you how to know when you're at the end of the bit stream.)
Since every state has exactly 2 possible successor states (corresponding to a
0
bit and a1
bit), the state transitions form a binary tree. At every nonleaf node you're reading a bit to decide whether to go to the left child or the right child, and at every leaf node you've finished decoding a symbol, you output that symbol, and then go back to the root.You can build the tree from a list of symbols and their encodings, and then traverse the tree to decode the input. Writing the code to build the tree will probably be the thing that gives you the experience to truly understand the Huffman code.
When you have an input list like
A 00 B 010 C 011 D 100 E 1010 F 1011 G 1100 H 1101 I 1110 J 1111
your tree structure should satisfy
root>symbol is null root>left>symbol is null root>left>left>symbol = A root>left>right>left>symbol = B root>left>right>right>symbol = C ...
(In this pseudocode, every node has 3 attributes, but in a real language, you will probably find a more efficient representation. Every node needs to have either a
symbol
or a pair of pointers/references toleft
andright
child nodes.)Since you have a static code list, you only need to build the tree once and you can reuse it for the lifetime of the program.