Decode Huffman encoding in HTTP/2
I'm trying to understand how to do Huffman decoding of an HTTP/2 header. Most of the docs I see talk about having a binary tree for the frequency table, but for HTTP/2 it's just a static lookup table. I've got the encoding part working fine, but the decoding is confusing me as I don't know how to tell how many bits I'm supposed to take each time.
A Huffman code is a prefix-free code. This means that no encoded symbol is a prefix of any other encoded symbol.
If there is a symbol represented by the bit string
00111, then there can't be one represented by
0011100110011- nothing else will start with
00111. If you've read the bit string
00111, you don't need a marker to tell you that you're at the end of a symbol. If you can produce an output symbol from the bits you've read so far, you must produce that output symbol and start reading the next one.
When you've read
0011, you won't be able to output anything because
0011is not a symbol. It can't be, because it's a prefix of
Huffman codes always assign some meaning to every possible bit string (except one that ends prematurely). The fact that
0011has no meaning by itself means that there must be at least 2 symbol codes starting with
0011. At least one will start with
00110and at least one will start with
To decode an input bit stream, you start from a state representing no input, then read a bit, and move to the state representing the bits you've read so far. For example, if you're in state
00and you read a 1, you move to state
001. When you reach a state that corresponds to a symbol, you output that symbol and move back to the initial state before reading the next bit.
(Note that detecting the end of stream is outside the scope of Huffman coding. The containing protocol must tell you how to know when you're at the end of the bit stream.)
Since every state has exactly 2 possible successor states (corresponding to a
0bit and a
1bit), the state transitions form a binary tree. At every non-leaf node you're reading a bit to decide whether to go to the left child or the right child, and at every leaf node you've finished decoding a symbol, you output that symbol, and then go back to the root.
You can build the tree from a list of symbols and their encodings, and then traverse the tree to decode the input. Writing the code to build the tree will probably be the thing that gives you the experience to truly understand the Huffman code.
When you have an input list like
A 00 B 010 C 011 D 100 E 1010 F 1011 G 1100 H 1101 I 1110 J 1111
your tree structure should satisfy
root->symbol is null root->left->symbol is null root->left->left->symbol = A root->left->right->left->symbol = B root->left->right->right->symbol = C ...
(In this pseudocode, every node has 3 attributes, but in a real language, you will probably find a more efficient representation. Every node needs to have either a
symbolor a pair of pointers/references to
Since you have a static code list, you only need to build the tree once and you can reuse it for the lifetime of the program.