Python how to get length of data compressed with zlib?

I have a file with multiple zlib-compressed binary data, and the offsets and lengths are unknown. Below, I have a script that gets the offset of the byte after the final zlib compressed data, which is what I need. The script works; however, in order to get the length of the original zlib compressed data, I have to decompress it and re-compress it. Is there a better way to get the length without having to re-compress it? Here's my code:

import zlib


def inflate(infile):
    data = infile.read()
    offset = 0
    while offset < len(data):
        window = data[offset : offset + 2]
        for key, value in zlib_headers.items():
            if window == key:
                decomp_obj = zlib.decompressobj()
                yield key, offset, decomp_obj.decompress(data[offset:])
        if offset == len(data):
            break
        offset += 1


if __name__ == "__main__":
    zlib_headers = {b"\x78\x01": 3, b"\x78\x9c": 6, b"\x78\xda": 9}

    with open("input_file", "rb") as infile:
        *_, last = inflate(infile)

    key, offset, data = last
    start_offset = offset + len(zlib.compress(data, zlib_headers[key]))

    print(start_offset)

1 answer

  • answered 2022-04-23 15:29 Mark Adler

    Recompressing it won't even work. The recompression could be a different length. There is no assurance that the result will be the same, unless you control the compression process that made the compressed data in the first place, and you can guarantee that it uses the same compression code, same version of that code, and exactly the same settings. There is not even enough information in the zlib header to determine what the compression level was. By the way, your list of possible zlib headers is incomplete. There are 29 others it could be. The easiest and most reliable way to determine whether or not a zlib stream starts at the current byte is to begin decompressing until you either get an error or it completes. The first thing the decompressor will do is check the zlib header for validity.

    To find the length of the decompressed data, feed decomp_obj.decompress() a fixed number of bytes at a time. E.g. 65536 bytes. Keep track of how many bytes you have fed it. Stop when decomp_obj.eof is true. That indicates that the end of the zlib stream has been reached. Then decomp_obj.unused_data will be the bytes you fed it that were after the zlib stream. Subtract the length of the leftover from your total amount fed, and you have the length of the zlib stream.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum