Decode Huffman encoding in HTTP/2

I'm trying to understand how to do Huffman decoding of an HTTP/2 header. Most of the docs I see talk about having a binary tree for the frequency table, but for HTTP/2 it's just a static lookup table. I've got the encoding part working fine, but the decoding is confusing me as I don't know how to tell how many bits I'm supposed to take each time.

1 answer

  • answered 2018-01-13 20:04 Wumpus Q. Wumbley

    A Huffman code is a prefix-free code. This means that no encoded symbol is a prefix of any other encoded symbol.

    If there is a symbol represented by the bit string 00111, then there can't be one represented by 001110 or 001111 or 0011100110011 - nothing else will start with 00111. If you've read the bit string 00111, you don't need a marker to tell you that you're at the end of a symbol. If you can produce an output symbol from the bits you've read so far, you must produce that output symbol and start reading the next one.

    When you've read 0011, you won't be able to output anything because 0011 is not a symbol. It can't be, because it's a prefix of 00111.

    Huffman codes always assign some meaning to every possible bit string (except one that ends prematurely). The fact that 0011 has no meaning by itself means that there must be at least 2 symbol codes starting with 0011. At least one will start with 00110 and at least one will start with 00111.

    To decode an input bit stream, you start from a state representing no input, then read a bit, and move to the state representing the bits you've read so far. For example, if you're in state 00 and you read a 1, you move to state 001. When you reach a state that corresponds to a symbol, you output that symbol and move back to the initial state before reading the next bit.

    (Note that detecting the end of stream is outside the scope of Huffman coding. The containing protocol must tell you how to know when you're at the end of the bit stream.)

    Since every state has exactly 2 possible successor states (corresponding to a 0 bit and a 1 bit), the state transitions form a binary tree. At every non-leaf node you're reading a bit to decide whether to go to the left child or the right child, and at every leaf node you've finished decoding a symbol, you output that symbol, and then go back to the root.

    You can build the tree from a list of symbols and their encodings, and then traverse the tree to decode the input. Writing the code to build the tree will probably be the thing that gives you the experience to truly understand the Huffman code.

    When you have an input list like

    A 00
    B 010
    C 011
    D 100
    E 1010
    F 1011
    G 1100
    H 1101
    I 1110
    J 1111

    your tree structure should satisfy

    root->symbol is null
    root->left->symbol is null
    root->left->left->symbol = A
    root->left->right->left->symbol = B
    root->left->right->right->symbol = C

    (In this pseudocode, every node has 3 attributes, but in a real language, you will probably find a more efficient representation. Every node needs to have either a symbol or a pair of pointers/references to left and right child nodes.)

    Since you have a static code list, you only need to build the tree once and you can reuse it for the lifetime of the program.