Disclaimer: I hate writing. I’m using AI to get my ideas onto paper. The opinions, experience, and numbers are mine. The grammar is not.

The first rusticle post was about skipping work: reuse the GIF’s existing palette when you can, avoid full requantization, and let the fast path do the boring thing quickly.

The follow-up was about quality: Butteraugli exposed correctness and representation bugs that PSNR/SSIM were too willing to average away.

This one is about the part underneath both of those: quantization.

Not the sexy version where every idea works and the ending is a massive benchmark chart. The useful version. I replaced a constrained fallback quantizer path, found a few real wins, and then reverted three plausible optimizations because the broader corpus said they were not worth keeping.

That last part is the point.

The constraint

Rusticle can use imagequant, which is excellent. But I still wanted a practical no-imagequant path: permissive, embeddable, and good enough that the encoder does not fall off a cliff when the main quantizer is unavailable or not desired.

The old fallback path was not where I wanted it. It worked, but it was not something I trusted as a long-term foundation. The goal was not to beat imagequant everywhere. That would be a different project. The goal was narrower:

  • replace the fallback with something I understood
  • keep the stack permissive
  • get closer to acceptable quality
  • avoid paying for quantization when the input was already effectively encoded
  • measure on enough files that I could delete bad ideas without guessing

The last bullet mattered more than I expected.

The shape of the final stack

The retained path ended up being boring in a good way:

source pixels
   |
   |-- <=256 exact colors? -> exact bypass
   |
   |-- source palette useful? -> seed palette
   |
   v
Wu quantizer -> weighted refinement -> AVX2 remap -> dither -> GIF

What stayed:

  • custom Wu fallback quantizer
  • source-palette seeding
  • exact <=256 color bypass
  • AVX2 nearest-color on x86
  • weighted unique-color refinement
  • simple dither policy: ordered dither for quality <= 70, Floyd-Steinberg otherwise

The ARM64 NEON nearest-color path also exists, but I do not consider it fully validated yet. It still needs real Mac/ARM benchmark coverage before I would put weight behind it.

The fastest quantizer is the one you do not run

The cleanest win was the exact-color bypass.

If a frame has <=256 exact colors, there is no reason to build a new palette. The GIF format can already represent it. Running quantization anyway is just spending CPU to maybe make the output worse.

So the fallback path first counts unique colors. If the frame fits, it emits those colors directly and remaps pixels without invoking Wu or refinement.

That sounds obvious. It is also the kind of obvious thing that disappears once an encoder pipeline gets generic enough. Everything becomes “RGBA in, palette out,” and suddenly you are quantizing frames that are already palette-compatible.

Source-palette seeding is the same idea in softer form. GIF input already has palette information. Even after resizing, that palette often contains useful structure. Seeding from it gives the fallback quantizer a better starting point than pretending every frame is a fresh RGB image with no history.

Neither idea is clever. Both are useful.

Wu was the right fallback baseline

I replaced the old fallback with a custom Wu quantizer. Wu quantization is a good fit here because it gives a deterministic palette quickly and does not require the same kind of iterative search as a heavier quantizer.

It is not magic. It is a practical baseline:

  • build a color histogram
  • partition color space into boxes
  • choose representative colors
  • remap pixels to the nearest palette entry

That made the fallback easier to reason about and easier to optimize. Once the algorithm was under my control, the hotspot became obvious: nearest-color search.

SIMD helped after the algorithm stopped wasting effort

I tried Linux perf first. The machine said no: perf_event_paranoid = 4. That is normal on hardened systems and annoying when you are trying to do real profiling.

So I used gprofng as the fallback. It was good enough to point at the obvious target. After AVX2 landed, photo_01 still spent most of its visible CPU in nearest-color work:

SymbolExclusive CPU
rusticle::quantize::kmeans::nearest_color_avx277.36%

Caller split:

CallerShare
refine_palette68.29%
final quantization/remap path31.71%

That tells you two things.

First, AVX2 was aimed at the right place. Nearest-color dominates once the dumb waste is gone.

Second, SIMD does not remove the algorithmic cost. It makes the cost less bad. If you can bypass quantization entirely, do that first. If you cannot, then SIMD is worth caring about.

The retained stack numbers on the small trio looked like this:

FileTimingQuality
cartoon_0265.66ms -> 62.68msPSNR 40.36, SSIM 0.9991
photo_01213.41ms -> 188.96msPSNR 40.35, SSIM 0.9990
voyager94.18ms -> 96.98msPSNR 41.53, SSIM 0.9992

That is not a victory-lap table. Cartoon got a small win. Photo got a real win. Voyager got slightly slower. The Voyager story is not micro-tuning; it is mostly exact-color bypass and indexed-structure handling.

This is why I do not like benchmark posts that only show the rows that went up.

The three optimizations I deleted

This was the real optimization campaign:

idea -> small benchmark -> looks good
                      |
                      v
              broader corpus check
                |             |
                v             v
              keep          revert

Three ideas looked plausible. All three lost.

1. Seeded zero-refine

The idea was simple: if a seeded palette is already strong enough, skip k-means refinement entirely.

The loose gate looked attractive:

  • sample limit 256
  • mean SSE <= 32

On the trio:

FileEncodeQuality delta
cartoon_0297ms -> 76msPSNR 40.60 -> 40.05, BA 2.47 -> 2.58
photo_01211ms -> 133msPSNR 40.32 -> 39.19, BA 2.57 -> 2.80
voyager54ms -> 49msunchanged quality

That is tempting. It is also wrong for a default. The quality drop on real content was not acceptable.

So I tightened the gate:

  • high quality only (>= 71)
  • non-empty seeded palette only
  • near-cap seeds (max_colors - 16)
  • sample limit 1024
  • mean SSE <= 8
  • max SSE <= 64

That preserved quality, but the win mostly disappeared:

FileEncodeQuality delta
cartoon_0297ms -> 93msunchanged
photo_01211ms -> 222msunchanged
voyager54ms -> 54msunchanged

Then the broader 30-file x86 suite made the decision easy:

MetricValue
Total wall time delta+40.96 ms over suite
Current total12,375.31 ms
Baseline total12,334.35 ms
Delta %+0.33%
Avg per-file wall delta+1.37 ms
Avg CLI total delta+3.46 ms
Avg BA delta+0.001
Avg PSNR delta-0.0117 dB
Avg SSIM delta~0

Category breakdown:

CategoryRuntime deltaNotes
photographic-28.61 ms avgfaster
large+31.48 msslower
many_frames+10.16 msslower
pixel_art+8.79 msslower
transparent+3.30 msslower
cartoon+2.30 msslower
simple-2.19 msslightly faster

Biggest regression: large_dims_02 at +85.74 ms. Biggest improvement: photo_04 at -141.13 ms.

The result was effectively neutral, maybe a niche win on photographic content. Not enough to justify default complexity. Reverted.

2. Sampled dither dispatch

The idea: choose no-dither, ordered dither, or Floyd-Steinberg from sampled error instead of a fixed quality split.

This is exactly the kind of optimization that sounds right. Dithering is content-dependent. Why not sample the frame and choose dynamically?

Because the corpus said no.

The broader x86 suite against the retained baseline showed:

MetricValue
Runtime delta+0.10%
Avg BA delta+0.072
Avg PSNR delta-0.426 dB
Avg SSIM delta-0.000097

The average was not the real problem. The category regressions were:

CategoryBA deltaPSNR deltaSSIM delta
pixel_art avg+0.37-2.34 dB-
many_frames avg+0.16-0.72 dB-0.000625

Worst file regressions:

FileBA deltaPSNR delta
pixel_art_02+1.48-9.36 dB
many_frames_01+0.64-2.89 dB

The trio looked okay. The broader suite exposed the cost. Reverted.

The replacement was deliberately dumb: ordered dither for quality <= 70, Floyd-Steinberg otherwise. It is less clever and easier to explain. That matters.

3. Temporal palette reuse

The idea: reuse frame N-1’s final palette/LUT for frame N when frames are close enough.

This one felt promising because GIFs are temporal. Adjacent frames are often similar. Reusing the previous palette should avoid work and maybe improve stability.

The catch was implementation shape. The no-imagequant path had to become more sequential to make reuse decisions. That cost was brutal.

30-file suite result:

MetricValue
Avg wall runtime delta+636.81 ms/file
Avg BA delta+0.3787
Avg PSNR delta-1.0333 dB
Avg SSIM delta-0.0001233
Avg output bytes delta-24,089 bytes

Category breakdown:

CategoryRuntime deltaBA delta
many_frames+871.89 ms+0.5675
cartoon+657.51 ms+0.1960
photographic-325.54 ms (one strong outlier)+0.3040
transparent+803.43 ms+0.2250
pixel_art+1,594.40 ms+0.56

Top runtime regressions:

FileRuntime delta
pixel_art_03+3,520.39 ms
pixel_art_04+2,032.86 ms
small_simple_02+2,026.26 ms
many_frames_04+2,013.30 ms
photo_05+1,204.92 ms

It did reduce output bytes by about 24 KB on average. That was not enough. Runtime got much worse and quality got worse. Reverted immediately.

What survived

The final retained stack is not the version with the most ideas in it. It is the version where each remaining piece has a reason to exist.

Stage timing still says encode matters:

FileDecodeProcessEncode
cartoon_0228ms7ms26ms
photo_01111ms16ms55ms
pixel_art_0225ms11ms71ms

So I kept the pieces that reduced real encode cost or made the fallback more robust:

  • exact-color bypass, because not quantizing is the best quantizer
  • source-palette seeding, because GIF inputs already contain useful palette information
  • custom Wu fallback, because the code is understandable and permissive
  • weighted unique-color refinement, because common colors should matter more than rare ones
  • AVX2 nearest-color, because profiling pointed directly at that hotspot

I did not keep:

  • zero-refine, because the safe version was basically neutral
  • sampled dither dispatch, because it hurt pixel-art and many-frame GIFs
  • temporal palette reuse, because it made the path sequential and slow

That is a better result than shipping all of it.

Small local wins are not product wins

I also tried RUSTFLAGS="-C target-cpu=native". It gave modest local gains, roughly 4-6% on cartoon/photo, with no meaningful Voyager win. I did not keep it as repo policy.

That is another example of the same rule: a local speedup is not automatically a project decision. It has deployment costs, portability costs, and benchmark-story costs. If the win is modest and machine-specific, it belongs in a note, not in the default build.

There were earlier dead ends too: a palette-space resize prototype that was too slow and lower quality, and a final remap LUT experiment that was slower and changed output. Both sounded reasonable. Both lost.

Where this leaves the quantizer

The satisfying part is not that every optimization worked. Most did not.

The satisfying part is that the process let me keep the few that clearly did and delete the ones that did not. The repo ended up simpler than the “keep every clever idea” version would have been.

That is the standard I want for this code now:

  • small benchmark first
  • broad corpus before default
  • quality and runtime together
  • revert without drama when the numbers say no

The remaining caveat is ARM. The NEON path needs real validation on Mac/ARM hardware before I talk about it the same way I talk about the x86 path. Until then, it is promising code, not a proven result.

Most optimization work does not end with a trick. It ends with a smaller diff than you expected and a better reason for every line that stayed.