profile picture

Optimizing LZ Match Size

August 10, 2025 - compression rust

This is a short post about a recent optimization I made in the process of adding new fast compression levels to the fdeflate crate. As a bit of background, fdeflate fast zlib compression and compression as a component of the png crate.

Background

LZ compression works by finding matches in the input data and replacing them with references to previous occurrences. This article provides a good overview of how modern LZ compression works.

One detail of lossless compression algorithms that is often overlooked is that fast compressors rely on a whole range of ad-hoc heuristics. There's no such thing as "zlib level 3" beyond just the collection of heuristics that the zlib library decided to bundle up and call "level 3".

There actually are nearly optimal algorithms for LZ: you can get very good compression ratios by combining of dynamic programming for the parsing step (called "optimal parsing") with fancy datastructures like suffix-arrays for finding all the matches in O(n) time. The only catch is that each pass over the input data runs at about 10 MB/s, and if your compression format uses entropy coding, you'll need to run multiple passes over the input data to reach a fixed point. Most compression tools only use these algorithms at the highest compression levels, if at all.

LZ Match Depth

A very common heuristic for match finding is to limit the search depth. For instance, the miniz_oxide crate only checks the first 1, 6, and 36 possible matches at the lowest three compression levels, respectively. But this isn't the only heuristic that can be applied to improving the match finding process.

Match Size

Any given LZ compression algorithm will have some minimum match size that it can represent. In LZ4, the minimum match size is, perhaps unsurprisingly, 4 bytes. And in zlib the minimum is 3 bytes. However, just because a match is representable doesn't mean that it is worth using. Thanks to entropy coding, it may be cheaper to represent three bytes of input data as literals, rather than a single short match. This is particurly relevant to PNGs which have a very skewed distribution of input data values after filtering.

One of the clever changes that zlib-ng made compared to the original zlib library is that it only searches for matches of 4+ bytes. It uses this strategy regardless of the compression level. The zstd library goes further: the minimum match length actually depends on both the compression level and the expected input data size. This makes a lot of sense for zstd because the format has a very large window size where finding a 6 or 7 byte match from many megabytes back could be useful, but a 3 or 4 byte match might be worse than using literals.

When I initially saw this, I thought zstd's approach was unsuitable for anything but the fastest compression levels when doing deflate compression. At level 1 or 2 maybe it is OK to miss tons of matches because you'll waste way less time looking at them. But for anything higher, the idea of intentionally ignoring all 4 or 5 bytes matches seemed like it was going too far.

But I realized that this is the wrong way to look at it. The maximum search depth heuristic already means that anything but the highest compression levels will miss potentially hundreds of matches, and there's no upper bound on how much compression ratio that gives up. Only putting longer matches into the lookup structure also means we'll miss some matches. But that's made up for because the matches we do find will be the "good" ones that save more bytes.

Implementation

The actual implemention of this idea is quite simple. When inserting into the hash table, fdeflate loads 8 bytes from the input, then masks out everything past the minimum match length, then hashes it and stores the index.

When looking up matches, the code is unaffected. It already XOR'd the 8-bytes of the current location with the 8-bytes of the match candidate and used .trailing_zeros() / 8 to count the number of matching bytes. So no changes were required other than updating the minimum length check to be variable rather than hardcoded to 4 bytes.