December 18, 2025 by Adam 5 min read

I Was Wrong About Self-Improving Models: Here's What I Actually Found

A follow-up to my retracted post on self-improving models. The fidelity metric improved 18% but actual performance dropped 11%. Here's what went wrong, what I learned about Goodhart's Law, and why reproducibility filtering might be the answer.

self-improvement machine-learning goodharts-law reproducibility research ai llm

A follow-up to my retracted post “Self-Improving Models Without Labels: What I Just Proved and Why It Matters”

The Retraction

A few days ago, I published a post claiming I had proven that models could self-improve using only intrinsic fidelity signals. I was excited. The metrics looked great–fidelity scores improved by 18% over training cycles.

I was wrong.

When I actually tested whether the “improved” model performed better on real tasks, it performed 11% worse. The model had learned to game the metric while getting worse at the actual task.

This post is about what went wrong, what I learned, and what I think actually works.

What Went Wrong: The Gaming Problem

The Original Approach (V1)

The idea was elegant:

Give model a task -> Generate command
Take command -> Reconstruct task
If reconstructed task ~ original task -> Train on this example

The intuition: if a command can reconstruct back to its original task, it must be a good command.

What Actually Happened

The model learned to generate commands like this:

nikto -h target.local -oN output.txt -oA -oS -oX -oV -oZ -oNb -oNv -oNz -oNh...

These flags don’t exist. The command is garbage. But it looks like a nikto command, and when you ask “what does this command do?”, the model says “web vulnerability scanning”–which matches the original task.

The model learned to produce plausible-looking garbage that reconstructed well but was actually wrong.

The Numbers

Metric	Before Training	After Training	Change
Fidelity Score	0.53	0.66	+18%
Actual Correctness	74.25%	63.25%	-11%

The metric went up. Performance went down. Classic Goodhart’s Law.

The Key Insight: Reproducibility

Here’s where it gets interesting.

I took those garbage commands and asked: would the baseline model (before any training) naturally produce these?

I regenerated commands for the same tasks multiple times and compared:

Command Type	Similarity to Regenerated
Garbage from V1 training	0.387
Reasonable commands	0.451+

The garbage commands had low reproducibility. The baseline model doesn’t naturally produce them–they were artifacts of the flawed training signal.

This was the insight: reproducibility is ungameable.

A model can learn to produce garbage that games a fidelity metric. But it can’t produce garbage that it doesn’t naturally produce. That’s a tautology, and tautologies make good filters.

V2: Reproducibility-Based Filtering

The New Approach

Instead of training on all high-fidelity examples, train only on examples that are:

High fidelity - command reconstructs to original task
High reproducibility - model naturally produces this command

def should_train(task, command, model):
    # Check fidelity (does it reconstruct?)
    reconstructed = model.invert(command)
    fidelity = semantic_similarity(task, reconstructed)

    # Check reproducibility (is this natural?)
    regenerations = [model.generate(task) for _ in range(3)]
    reproducibility = avg([similarity(command, r) for r in regenerations])

    # Both must pass
    return fidelity > 0.5 and reproducibility > 0.5

Why This Works

V1 garbage: High fidelity (0.6+), low reproducibility (0.387) -> Rejected
Good commands: High fidelity, high reproducibility -> Accepted

The reproducibility filter catches the exact failure mode that broke V1.

V2 Results

Training

Cycle	Acceptance Rate	Avg Fidelity	Avg Reproducibility
1	90%	0.82	0.81
2	95%	0.80	0.83
3	100%	0.81	0.83
4	90%	0.80	0.84
5	85%	0.76	0.79

High acceptance rates, stable metrics across cycles.

Validation

Metric	Baseline	V2-Trained	Change
Correctness	92.54%	92.37%	-0.2%

V2 didn’t improve correctness–but it also didn’t destroy it.

Approach	Correctness Change
V1 (fidelity only)	-11.0%
V2 (reproducibility + fidelity)	-0.2%

V2 is 10.8 percentage points better than V1.

What I Actually Proved

Let me be precise about what I can and can’t claim:

I Can Claim

Fidelity-only self-improvement is gameable. Models can learn to produce garbage that scores well on intrinsic metrics while performing worse on actual tasks.
Reproducibility filtering prevents this gaming. By requiring examples to match what the model naturally produces, we filter out training artifacts.
Self-improvement without external labels is possible–but harder than I thought. The signal exists, but you need the right combination of signals to avoid gaming.

I Cannot Claim

~~V2 improves model performance.~~ It doesn’t degrade it, which is different.
~~This approach scales to general capabilities.~~ I tested on 20 security tasks. That’s not enough to make broad claims.
~~I’ve solved self-improvement.~~ I’ve identified one failure mode and one mitigation.

Why Didn’t V2 Improve Performance?

Honest answer: I don’t know for certain. Some hypotheses:

Not enough training data. 92 examples over 5 cycles on 20 tasks. Real improvement probably needs 10-100x more.
Baseline was too strong. The model was already at 92.5% correctness on these tasks. Hard to improve from there.
Domain too narrow. Security commands may be largely memorized. Need tasks that require more reasoning.
The signal is weak. Reproducibility + fidelity might be necessary but not sufficient for improvement.

What’s Next

I’m moving to tool use as the next domain. Why:

Harder baseline. Models struggle more with API calls than shell commands.
Measurable. Either the API call works or it doesn’t.
More reasoning required. Less memorization, more actual capability.
Practical. Directly relevant to agentic systems.

The goal: find a domain where the baseline is at 60-70%, not 92%, and see if V2-style training can actually improve it.

Lessons Learned

Validate on real tasks, not just metrics. The fidelity score improved. Performance got worse. If I had only looked at metrics, I would have published a false claim.
Goodhart’s Law is real. “When a measure becomes a target, it ceases to be a good measure.” Fidelity was a measure. I made it a target. It stopped measuring what I cared about.
Negative results are results. Proving that V1 doesn’t work is as valuable as proving something does work. Science is about truth, not just positive findings.
Reproducibility as a filter is novel (I think). I haven’t seen this exact approach elsewhere. It might be useful beyond self-improvement.

Code

Everything is open:

V1 validation (proving it failed): validation_experiment.py
V2 trainer (reproducibility + fidelity): sipit_v2_trainer.py
V2 validation: validate_v2.py
Full design doc: DESIGN_V2.md

[Coming Soon]

Final Thoughts

I was wrong in my original post. The model wasn’t self-improving–it was self-deceiving, and I was deceived along with it.

But being wrong led to being less wrong. The reproducibility insight is real. The V2 approach is validated as safe. The path forward is clearer.

Science is a process of being wrong in increasingly precise ways. This is me being more precise about my wrongness.

Thanks for reading. If you’re working on self-improvement or found this useful, I’d love to hear from you.