Sign In
Access to Author tools and Claude Code Assistant requires authentication.
Retracted

This post contains conclusions or claims that I have since determined to be incorrect or insufficiently supported. I'm keeping it published for transparency, but please read the correction below.

Read why I was wrong
by Adam 9 min read

Self-Improving Models Without Labels: What I Just Proved and Why It Matters

A 7B model taught itself to generate better security commands using only its own understanding signals. No human labels, no external reward. Here's how and why it matters.

ai machine-learning self-improvement llm homelab security research fine-tuning

The Experiment That Kept Me Up Last Night

I ran an experiment that proved something I’ve been theorizing about for months: a 7B parameter model can teach itself to get better at a task using only its own internal “understanding” signals. No human labels. No external reward models. No bigger model acting as a judge.

The results after 10 cycles:

BEFORE (Cycle 1):
  Intrinsic Score:    0.533
  Acceptance Rate:    60.7%

AFTER (Cycle 10):
  Intrinsic Score:    0.627
  Acceptance Rate:    88.0%

IMPROVEMENT:
  Score:              +0.096 (+18.1%)
  Acceptance:         +27.3 percentage points
  Training Loss:      5.70 -> 4.73 (-17%)

This isn’t a toy benchmark. The model was generating real security commands – nmap scans, SQL injection tests, directory enumeration. Commands that either work or don’t when you run them against a live target.

The Plain-English Version

Here’s what I did:

  1. Asked the model to write commands for security tasks (e.g., “scan port 80 on 192.168.1.1”)
  2. Asked the same model to explain what each generated command would do
  3. If its explanation matched the original task, I treated that as a “good” example
  4. Used those good examples to retrain the model
  5. Repeated for 10 cycles

After a few rounds, the model got better at generating commands it actually understands – without me ever telling it which commands were “correct.”

The key insight: if a model can accurately reconstruct the original task from its own output, it probably generated something sensible. If it can’t explain what it just wrote, it probably hallucinated.

Why This Is Different From Everything Else

Let me be precise about what makes this novel. The AI research community has explored self-improvement before, but always with external scaffolding:

MethodHuman LabelsGround TruthExternal Judge
RLHFRequiredHuman preference
RLAIFInitial SFT dataSeparate AI model
STaR (2022)NoYes
SPIN (2024)Yes (seed)Yes
Self-Rewarding (2024)NoNoYes (self-as-judge prompt)
Constitutional AINoNoYes (principles)
This experimentNoNoNo

The difference: I’m not using a separate “judge” prompt or importing preferences from a larger model. The scoring signal comes entirely from inversion – asking the model to reconstruct the input from its own output. If reconstruction succeeds, the example is good. If it fails, the example is bad.

No external principles. No evaluation rubrics. Just: “Do you understand what you just generated?”

How It Actually Works

The Mechanism

FORWARD PASS:
  Task: "Scan port 80 on 192.168.1.1"
    | Model generates
    v
  Command: "nmap -p 80 192.168.1.1"

INVERSE PASS:
  Command: "nmap -p 80 192.168.1.1"
    | Model reconstructs
    v
  Task: "Scan port 80 on host 192.168.1.1"

INTRINSIC SCORE:
  similarity(original_task, reconstructed_task) = 0.95

  High score = Model understands what it generated
  Low score = Model doesn't understand its own output

The Scoring Function

The similarity score combines three intrinsic signals:

  1. Token overlap (20%): Do key terms from the original task appear in the reconstruction?
  2. Sequence similarity (30%): Does the structure match (using difflib)?
  3. LLM self-rating (50%): Ask the model itself to rate semantic similarity 0-10

That third component is critical. The model judges whether two task descriptions mean the same thing. This is fully intrinsic – no external model, no human preference data.

The Training Loop

# Pseudocode for the full loop
for cycle in range(10):
    tasks = sample_random_tasks()  # "Scan X", "Find vulns in Y", etc.

    # Generate multiple candidates per task
    candidates = [
        (task, generate_command(task))
        for task in tasks
        for _ in range(3)  # 3 candidates each
    ]

    # Score via inversion
    scored = [
        (task, cmd, score(task, invert(cmd)))
        for (task, cmd) in candidates
    ]

    # Keep only high-scoring examples
    accepted = [ex for ex in scored if ex.score >= 0.5]

    # Fine-tune on accepted examples
    finetune(model, accepted)  # LoRA, ~4 minutes per cycle

Each cycle:

  • Generates 150 candidate commands (50 tasks x 3 candidates)
  • Scores them all via inversion
  • Keeps examples scoring above 0.5
  • Fine-tunes the model on the accepted set
  • Moves to the next cycle with an improved model

Cycle-by-Cycle Progress

CycleBefore ScoreAfter ScoreAcceptance
10.5310.54767.3%
20.5230.57674.7%
30.5520.55972.0%
40.5710.57778.0%
50.5870.59484.0%
60.6010.62083.3%
70.6210.64087.3%
80.5980.63288.0%
90.6200.65589.3%
100.6120.62788.0%

Score improved in 9 out of 10 cycles. Acceptance rate climbed from 60% to 88%. No mode collapse. The model remained stable and continued improving.

The Hardware Reality

This ran on a single RTX 3090 in my homelab. Total runtime: about 3 hours for 10 cycles.

Model: mistralai/Mistral-7B-Instruct-v0.2
Quantization: 4-bit (bitsandbytes)
Fine-tuning: LoRA (r=16, alpha=32)
VRAM usage: ~18GB peak
Cycles: 10
Tasks per cycle: 50
Candidates per task: 3
Total examples generated: 3,000
Total examples accepted: 2,392

This isn’t a datacenter experiment. It’s something you can reproduce on consumer hardware overnight.

Why Inversion Works as a Training Signal

The intuition: a model’s ability to reconstruct the input from its output reflects how well it understood the relationship between them.

If I ask the model to generate a command for “scan the top 100 ports on 192.168.1.1” and it outputs nmap --top-ports 100 192.168.1.1, then asking it “what does this command do?” should yield something close to the original task.

If instead it outputs garbage like nmap -sV --script=auth 10.0.0.0/8 (a completely different scan), the reconstruction will fail. The model will describe what that command does, which won’t match “scan the top 100 ports on 192.168.1.1.”

High reconstruction similarity = the model generated something coherent. Low reconstruction similarity = the model hallucinated or misunderstood.

This is measurable, automatic, and requires no external labels.

The Good: Safer, More Reliable Agents

This technique has immediate applications for building better AI agents.

Pre-Action Verification

Before an agent executes a command, it can inversion-check its own output:

def safe_execute(agent, task):
    command = agent.generate(task)
    reconstructed = agent.invert(command)
    similarity = score(task, reconstructed)

    if similarity < THRESHOLD:
        # Agent doesn't understand its own output
        return clarify_with_user(task)
    else:
        return execute(command)

Low inversion score = the agent should ask for clarification rather than blindly executing something it doesn’t understand.

Edge Deployment Without Labels

Small models running on edge devices (security appliances, network gear, IoT) can self-verify domain-specific tasks without expensive cloud-based labeling pipelines. A homelab running a local LLM for DevOps automation can improve itself overnight using only its own feedback.

Interpretability Signal

The inversion score provides a pre-action confidence signal tied to the model’s internal representations. Unlike token-level probabilities (which just measure “how likely was this token?”), inversion scores measure “does the model understand the semantic relationship between input and output?”

This is closer to what we actually care about for agent reliability.

The Bad: Dual-Use Concerns

I’m not going to pretend this research is purely benign. The same loop that improves a defensive security assistant can improve an offensive one.

Offensive Automation

The experiment domain was security command generation. That means the model got better at writing nmap scans, SQL injection tests, and brute-force attacks. A malicious actor could use this exact technique to refine exploit generation without any external review.

Quiet Bootstrapping

Because the model is its own judge, it can iterate in isolation. No API calls to a larger model. No human reviewers in the loop. Just a GPU running overnight, getting better at whatever you pointed it at.

Mitigations

This isn’t unsolvable, but it requires deliberate guardrails:

  1. Domain gating: High-risk domains (malware generation, exploit development) should require human approval for any training run, not just deployment.
  2. Inversion access controls: Rate-limit or audit the inversion pathway. If something is rapidly self-improving on security tasks, flag it.
  3. Buffer auditing: Log everything that gets added to the self-training buffer. Review what the model is teaching itself.

The difference between legitimate security research and malicious automation often comes down to intent and governance. This technique makes governance more important, not less.

Why This Isn’t “Just Fine-Tuning”

Standard fine-tuning looks like this:

Human creates dataset -> Model trains on dataset -> Model gets better

This looks like:

Model generates data -> Model scores data -> Model trains on self-scored data -> Repeat

Three critical differences:

  1. Self-generated data: Training examples come from the model itself, not a human curator. Each cycle produces new examples based on the model’s current capabilities.

  2. Self-scored labels: The “good” vs “bad” distinction comes from inversion, not human annotation. The model decides what to learn from.

  3. Closed feedback loop: Everything happens inside one model. No reference to a larger “teacher” model. No import of external preferences.

This is closer to how humans learn through self-reflection than traditional supervised learning.

Configuration for Reproducibility

If you want to run this yourself:

config = SelfImprovementConfig(
    model_name="mistralai/Mistral-7B-Instruct-v0.2",
    num_cycles=10,
    tasks_per_cycle=50,
    num_candidates_per_task=3,
    similarity_threshold=0.5,
    generation_temperature=0.7,
    learning_rate=2e-4,
    batch_size=4,
)

Key hyperparameters:

  • similarity_threshold=0.5: Accept examples with reconstruction similarity above 50%. Too high = not enough training data. Too low = training on garbage.
  • generation_temperature=0.7: High enough for diversity, low enough for coherence. The model generates 3 candidates per task; you want variation.
  • num_candidates_per_task=3: More candidates = more chances to find a good example. Diminishing returns above 5.

What This Means for AI Safety

The alignment community has focused heavily on external feedback mechanisms: RLHF, constitutional AI, debate between models. These all assume the training signal must come from outside the model being trained.

This experiment suggests an alternative: intrinsic feedback through inversion. A model can improve by asking whether it understands its own outputs.

That’s both promising and concerning:

Promising: We can build self-verifying agents that check their own understanding before acting. Reliability without expensive labeling.

Concerning: We can’t assume a model needs external access to improve. A model with only local compute can bootstrap itself in restricted domains.

The difference between these outcomes is governance. Technical capability is neutral; application is not.

Where to Go From Here

This was a proof-of-concept on a narrow domain (security commands) with a 7B model. Open questions:

  1. Does it scale? Would a 70B model show similar improvement curves? Faster? Slower?
  2. Does it generalize? Can inversion-based training improve coding, writing, reasoning?
  3. Where’s the ceiling? After how many cycles does improvement plateau?
  4. Can it collapse? Under what conditions does the loop degrade instead of improve?

I’ll be running extended experiments over the coming weeks. The infrastructure is set up; now it’s a matter of cycles and analysis.

The Bottom Line

A small model running on consumer hardware improved its task execution by 18% using only its own understanding as feedback. No human labels. No external judge. Pure intrinsic self-improvement.

If one small model can get better at what it understands – without labels or an external judge – imagine what disciplined teams (good or bad) can do. The difference will come down to governance, guardrails, and intent.

The code that does this isn’t complicated. The implications are.


Experiment completed: 2025-12-16 Runtime: ~3 hours on RTX 3090 Examples generated: 3,000 Final improvement: +18.1%