Blog ·
Shrinking a WebGPU background to 22 KB on the wire
I built a WebGPU animation: a faceted liquid-mirror grid simulated and rendered on the GPU with wgpu, compiled to WebAssembly. It ripples on its own from random impulses in the simulation. It loads on every page and it is not the content, so it cannot be expensive. A background animation that costs 100 KB has no business shipping.
It started at 101,806 bytes. I got it to 56,459 bytes, which Cloudflare serves as about 22 KB of Brotli over the wire. Here is everything that moved the number, including the things that didn't, because the dead ends are half the lesson.
The starting point
The crate was already built the way every guide tells you to build a release WASM:
[profile.release]
opt-level = "z" # optimise for size, not speed
lto = true # fat LTO: inline and prune across crates
codegen-units = 1 # no parallelism, better cross-function optimisation
panic = "abort" # no unwinding tables
strip = true # drop symbolsThen wasm-bindgen to generate the JS glue, then wasm-opt -Oz from Binaryen for another pass. That is the standard recipe, and it got me to 99.4 KB. Every further gain came from going past the standard recipe.
Rebuild the standard library: 30 KB gone
The biggest single lever is one most projects never pull, because it needs nightly: recompile std, core, and alloc from source, optimised for size, so your profile settings reach the standard library instead of stopping at your own crate. The prebuilt std that ships with the toolchain is compiled for speed and includes machinery you will never call.
On current nightly the recipe has three parts. A .cargo/config.toml:
[unstable]
build-std = ["std", "panic_abort"]
build-std-features = ["optimize_for_size"]A pinned toolchain, because you are now compiling the standard library and a drifting nightly can change the output or fail outright:
# rust-toolchain.toml
[toolchain]
channel = "nightly-2026-06-05"
components = ["rust-src"]
targets = ["wasm32-unknown-unknown"]And the panic strategy that actually deletes the panic-formatting code. This used to be a build-std feature called panic_immediate_abort; it is now a real strategy you set in the profile, gated by a cargo feature:
cargo-features = ["panic-immediate-abort", "trim-paths"]
[profile.release]
panic = "immediate-abort"
trim-paths = "all"immediate-abort turns every panic into a bare trap, with no message and no location string. This step alone took the module from 99.4 KB to 71.6 KB.
Measure, don't guess: the float formatter
At this point I stopped guessing and looked. The tool here is twiggy, which reads the WASM and tells you which functions own which bytes.
The profile turned up something I did not expect near the top of the list: core::fmt::float::float_to_decimal_common_shortest and a friend called bignum::Big32x40. That is the full f64-to-string algorithm: Grisu, with a bignum fallback for the hard cases. Several kilobytes of it. Why was a background shader linking the float printer?
Two lines:
let _ = style.set_property("width", &format!("{}px", backing_w as f64 / dpr));
let _ = style.set_property("height", &format!("{}px", backing_h as f64 / dpr));Setting the canvas CSS size. format!("{}px", some_f64) drags in the entire f64 Display implementation. I needed sub-pixel precision there (the backing store has to map cleanly to device pixels or the facet seams drift), so I couldn't just round to an integer. But I could format it as fixed-point with integer math and never touch the float printer:
fn css_px(backing: u32, dpr: f64) -> String {
let scaled = (backing as f64 / dpr * 1e4).round() as u64;
let frac = (scaled % 10_000) as u32;
let mut s = String::with_capacity(16);
push_uint(&mut s, scaled / 10_000); // integer part, hand-rolled
s.push('.');
for div in [1000, 100, 10, 1] { // four decimals, zero-padded
s.push((b'0' + (frac / div % 10) as u8) as char);
}
s.push_str("px");
s
}That removed about 8 KB, more than the float code itself, because once nothing reached the float path a chunk of supporting fmt plumbing went with it. Down to 63.5 KB. The lesson generalises: in a size-constrained WASM, string formatting is the most expensive thing you will reach for without noticing. The rustwasm book says it outright: keep formatting in debug builds, ship static strings in release. I applied the same idea to the one other formatting site, the error path, which only ever feeds a developer console:
fn err<E: std::fmt::Display>(e: E) -> JsValue {
#[cfg(debug_assertions)]
return JsValue::from_str(&e.to_string());
#[cfg(not(debug_assertions))]
{ let _ = e; JsValue::from_str("wgpu init failed") }
}The allocator: 5 KB gone
Rust's default WASM allocator is dlmalloc, and it costs somewhere around 4 to 5 KB. You only need a good allocator if you allocate on a hot path, and I don't: the GPU resources are a handful of long-lived boxes set up once. So I swapped in lol_alloc, a minimal allocator built for exactly this:
#[global_allocator]
static ALLOC: lol_alloc::AssumeSingleThreaded<lol_alloc::FreeListAllocator> =
unsafe { lol_alloc::AssumeSingleThreaded::new(lol_alloc::FreeListAllocator::new()) };AssumeSingleThreaded is sound here because wasm32-unknown-unknown has no threads, so there is only the one event loop touching the heap. That bought another 4.9 KB.
The next rung down would be to drop the allocator entirely and use static arrays, which other people have used to reach single-digit kilobytes. I can't: wgpu allocates internally (it builds Vecs and label strings), so an allocator has to exist. lol_alloc is the floor for anything that talks to wgpu.
Building it up from nothing
To understand where the rest of the bytes actually come from, I built a second crate from scratch with the same release pipeline and added one feature at a time, measuring the shipped WASM at each step. This is the most useful half-hour you can spend on size.
| layer | bytes | added |
|---|---|---|
empty #[wasm_bindgen] export | 123 | base |
| reach the DOM via web-sys | 8,437 | +8,314 |
format! with an integer | 11,595 | +3,158 |
format! with a float | 21,451 | +9,856 |
Rc<RefCell<_>> + a closure | 22,124 | +673 |
async + a JS promise | 26,878 | +4,754 |
| wgpu instance / adapter / device | 35,281 | +8,403 |
Two results overturned my assumptions. First, float formatting is the single most expensive feature in the table: more than reaching the entire DOM, more than bringing up the GPU. Second, Rc and RefCell are nearly free at 673 bytes. The folk wisdom that "the standard library bundles everything, so avoid Rc" is just wrong for the sharing primitives. It is dead right for formatting. Spend your effort where the bytes are.
The dead ends I measured
Not every lever moves. These all cost me a build and gave nothing, and I'd rather you not repeat them:
opt-level = "s"instead of"z". The books tell you to try both. I did."s"came out 3 KB larger here. Always measure; never assume which way it goes.-Zfmt-debug=noneand-Zlocation-detail=none. These stripDebugimpls and panic locations. They moved the size by one byte, becausepanic = "immediate-abort"had already deleted every panic location, and LTO had already dropped the unusedDebugcode. Levers that overlap with ones you've pulled give nothing.- Compile-time
target-feature=+bulk-memory,.... In theory, letting the compiler emitmemory.copyinstead of byte loops shrinks things. In practicewasm-optalready applies those transforms after the fact, so doing it earlier added 135 bytes. -Zvirtual-function-elimination. Promising on paper: strip unused vtable entries. It produced broken bitcode on this LLVM/WASM combination. Some nightly flags are nightly for a reason.
Minifying the shaders, and why it barely mattered
The biggest single thing in the finished module is not code. It's .rodata, and most of that is the WGSL shader source, embedded as a string. On the WebGPU backend, wgpu hands that text straight to the browser to compile, so it rides along verbatim. The obvious move was to minify it, at build time in a build.rs, so the source in the repo stays readable and the compact version is what gets embedded:
// build.rs: strip /* */ and // comments + per-line indentation, emit to OUT_DIR.
// WGSL has no string literals, so this can never corrupt a token.
fn minify_wgsl(src: &str) -> String { /* ... */ }That cut the shader text roughly in half: about 5.8 KB of comments and whitespace gone. The raw module dropped by a bit over 1 KB. The number that ships dropped by 84 bytes.
That gap is the whole lesson. Comments and indentation are the most compressible bytes in the file: repetitive, predictable, exactly what gzip and Brotli are built to erase. Stripping them before compression just does the compressor's job for it. I kept the build.rs anyway: the uncompressed module is what the browser parses and holds in memory, so a leaner one is honest even when the wire size barely moves.
Compression is the number that matters
All of the above is the uncompressed size. Nobody downloads that. Cloudflare serves the WASM with Brotli, and WebAssembly compresses very well: it is dense, repetitive bytecode. The 56.5 KB module gzips to 25,628 bytes, and Brotli at the edge takes it to roughly 22 KB. That is the figure that hits a visitor's connection and counts against the load budget.
| raw | gzip | |
|---|---|---|
| start | 101,806 | base |
| finished | 56,459 | 25,628 |
So measured on the wire, against the thing that actually loads, the background animation is about a fifth of where it began.
What's left
What remains is the program. The largest pieces are the (now minified) shader source, the async glue around device acquisition, and my own per-frame render code, which is the entire point of the exercise. There is no dead crate to delete, no accidental formatter, no oversized allocator.
That's the line I was aiming for. A binary where the largest remaining things are the shaders and the work itself, and the only way to make it smaller now would be to make it do less.
References
- rustwasm book, Reducing
.wasmsize: https://rustwasm.github.io/docs/book/reference/code-size.html - Leptos book, Optimizing WASM binary size: https://book.leptos.dev/deployment/binary_size.html
- Nick Babcock, Avoiding allocations in Rust to shrink WASM: https://nickb.dev/blog/avoiding-allocations-in-rust-to-shrink-wasm-modules/