Optimizing LZ Match Size
August 10, 2025 -This is a short post about a recent optimization I made in the process of adding
new fast compression levels to the fdeflate crate. As a bit of background,
fdeflate fast zlib compression and compression as a component of the png
crate.
Background
LZ compression works by finding matches in the input data and replacing them with references to previous occurrences. This article provides a good overview of how modern LZ compression works.
One detail of lossless compression algorithms that is often overlooked is that fast compressors rely on a whole range of ad-hoc heuristics. There's no such thing as "zlib level 3" beyond just the collection of heuristics that the zlib library decided to bundle up and call "level 3".
There actually are nearly optimal algorithms for LZ: you can get very good compression ratios by combining of dynamic programming for the parsing step (called "optimal parsing") with fancy datastructures like suffix-arrays for finding all the matches in O(n) time. The only catch is that each pass over the input data runs at about 10 MB/s, and if your compression format uses entropy coding, you'll need to run multiple passes over the input data to reach a fixed point. Most compression tools only use these algorithms at the highest compression levels, if at all.
LZ Match Depth
A very common heuristic for match finding is to limit the search depth. For
instance, the miniz_oxide crate only checks the first 1, 6, and 36 possible
matches at the lowest three compression levels, respectively. But this isn't the
only heuristic that can be applied to improving the match finding process.
Match Size
Any given LZ compression algorithm will have some minimum match size that it can represent. In LZ4, the minimum match size is, perhaps unsurprisingly, 4 bytes. And in zlib the minimum is 3 bytes. However, just because a match is representable doesn't mean that it is worth using. Thanks to entropy coding, it may be cheaper to represent three bytes of input data as literals, rather than a single short match. This is particurly relevant to PNGs which have a very skewed distribution of input data values after filtering.
One of the clever changes that zlib-ng made compared to the original zlib
library is that it only searches for matches of 4+ bytes. It uses this strategy
regardless of the compression level. The zstd library goes further: the
minimum match length actually depends on both the compression level and the
expected input data size. This makes a lot of sense for zstd because the
format has a very large window size where finding a 6 or 7 byte match from many
megabytes back could be useful, but a 3 or 4 byte match might be worse than
using literals.
When I initially saw this, I thought zstd's approach was unsuitable for anything but the fastest compression levels when doing deflate compression. At level 1 or 2 maybe it is OK to miss tons of matches because you'll waste way less time looking at them. But for anything higher, the idea of intentionally ignoring all 4 or 5 bytes matches seemed like it was going too far.
But I realized that this is the wrong way to look at it. The maximum search depth heuristic already means that anything but the highest compression levels will miss potentially hundreds of matches, and there's no upper bound on how much compression ratio that gives up. Only putting longer matches into the lookup structure also means we'll miss some matches. But that's made up for because the matches we do find will be the "good" ones that save more bytes.
Implementation
The actual implemention of this idea is quite simple. When inserting into the
hash table, fdeflate loads 8 bytes from the input, then masks out everything
past the minimum match length, then hashes it and stores the index.
When looking up matches, the code is unaffected. It already XOR'd the 8-bytes of
the current location with the 8-bytes of the match candidate and used
.trailing_zeros() / 8 to count the number of matching bytes. So no changes
were required other than updating the minimum length check to be variable rather
than hardcoded to 4 bytes.