All in a Day's Work: SAEs, MAX Engine, and a Memory That Thinks
Three things landed this week at Light of Baldr — an open SAE dataset for Gemma-4-31B, a working MAX Engine stack on DGX Spark, and early signal from an activation-level memory system.
Three things landed this week. They look unrelated on the surface — an interpretability dataset, an infrastructure fix, a memory experiment — but they’re all the same project from different angles: figuring out what a 31B-parameter model is actually doing inside, and using that understanding to make it do more.
The full technical writeups live on the Light of Baldr site. This post is the short version of the day’s work.
1. 60 Layers of Gemma-4-31B, Open on HuggingFace
We trained sparse autoencoders on every layer of Gemma-4-31B and ran each one through a 6-phase verification pipeline (SIPIT scoring, dual auto-interpretation, cross-validation, linear probes, confidence weighting). The result: 3,000 interpreted features, 104GB of trained weights, Apache 2.0.
The thing I keep staring at: the SIPIT profile shows three clean processing phases — gradual descent from token space, a deep integration zone around layer 27, then token rendering at the end. The same ~40-45% integration depth shows up in Qwen-3-4B and Gemma-3-1B too. Feels universal for attention-based transformers.
Dataset: Adam1010/gemma-4-31b-sae-features. Pipeline writeup on the LLC site.
2. MAX Engine on DGX Spark — Mostly Working
Modular’s MAX Engine claims day-zero Gemma-4 support and DGX Spark compatibility. In practice, getting Gemma-4-31B to actually serve required four fixes:
- A unified-memory patch so the memory estimator stops reading 34GB from CUDA and starts reading 128GB from
/proc/meminfo - A Pydantic
model_rebuild()dance to resolve deferred torch tensor types - A
LD_PRELOADshim that capscuMemAllocso CUDA doesn’t eat the OS - A
temperature: 0workaround until Modular fixes a top-K kernel that asks for more registers than the GB10’s SM 12.1 supports
With all of that in place, MAX compiles the model in ~290s, loads weights, starts the OpenAI-compatible server, and serves greedy inference. Bug filed at modular/modular#6488. Patches and the wrapper script are on the LLC site.
3. A Memory That Operates on Activations, Not Tokens
The third thing is the one I’m least ready to talk about, which is also the most interesting. We’ve been testing whether a model can recall activation patterns — not retrieved text — at one of the deep integration layers, and have its current processing nudged by the recall.
Early signal on Gemma-4-31B running locally: same prompt, with and without memory active, the model’s framing shifts toward the domain of the stored memories. Coherent. Accurate. Proportional to relevance.
That’s all I’ll say here. The mechanism, the indexing, the multi-layer architecture, the safety protocol against memory poisoning — those are on the LLC site as we validate each piece. If you’re working on activation-level memory, experience-driven adaptation, or interpretability-informed retrieval, I’d love to compare notes.
Why These Three Together
The interpretability work tells you where in the model thinking happens. The infrastructure work makes it possible to run experiments on a 31B model under your desk. The memory work uses both — you target the integration layers because the SAE work showed you which ones matter, and you can iterate fast because the inference stack is yours.
That’s the loop. Build the instruments, fix the rig, run the experiments, repeat.
Built by Light of Baldr LLC. Full technical writeups live there.