December 16, 2025 by Adam 5 min read

Self-Improving AI: Using Inversion Fidelity as a Training Signal

A novel discovery: hidden state inversion quality predicts model capability, enabling self-improving systems without external feedback

ai-training self-improvement transformers fidelity novel-research

The Key Discovery

Finding: There’s a strong correlation (r = 0.891) between how well a model’s hidden states encode information and its task performance.

Implication: Models can identify their own knowledge gaps without any external feedback.

This enables a new class of self-improving AI systems that verify and enhance their own understanding.

What is Inversion Fidelity?

Fidelity measures how accurately we can recover input tokens from hidden states:

def compute_fidelity(model, tokens):
    """
    What fraction of tokens appear in top-k predictions
    from their hidden states?
    """
    hidden_states = model.get_hidden_states(tokens)
    logits = model.lm_head(hidden_states)

    correct = 0
    for pos, token in enumerate(tokens):
        top_k = logits[pos].topk(k=5).indices
        if token in top_k:
            correct += 1

    return correct / len(tokens)

High fidelity (~6%+): Model strongly encodes the concept Low fidelity (~4%): Model weakly encodes the concept

The Correlation Experiment

I tested TinyLlama-1.1B on 12 Kali Linux security tools:

Tool	Fidelity	Selection Accuracy	Parameter Accuracy
nmap	6.2%	86%	67%
nikto	5.8%	100%	80%
sqlmap	5.1%	40%	20%
dirb	4.8%	20%	13%
metasploit	4.2%	0%	0%
hydra	4.1%	0%	0%
john	4.0%	0%	0%

Pearson correlation: r = 0.891 (p < 0.001)

Tools the model “understands” (high fidelity) are used correctly. Tools it doesn’t understand (low fidelity) fail completely.

Why This Matters

Traditional Training Feedback

User Input → Model Output → External Verifier → Reward → Update Weights
                              ↑
                         Requires Oracle

External verification is expensive:

Human evaluation: ~$0.10-1.00 per sample
Model-as-judge: ~$0.01 per sample
Ground truth comparison: Requires labeled data

Fidelity-Based Feedback

User Input → Hidden States → Inversion Fidelity → Self-Reward → Update Weights
                                    ↑
                             No Oracle Needed

The model verifies its own understanding before generating output.

Fidelity-Guided Training Algorithm

class FidelityGuidedTrainer:
    def __init__(self, model, corpus):
        self.model = model
        self.corpus = corpus
        self.sipit = SipItInverter(model)

    def compute_fidelity_weights(self, batch):
        """Weight examples by how poorly the model understands them."""
        weights = []
        for example in batch:
            fidelity = self.sipit.compute_fidelity(example)
            # Low fidelity = high weight (focus on weak areas)
            weight = 1.0 / (fidelity + 0.01)
            weights.append(weight)
        return normalize(weights)

    def train_step(self, batch):
        # Standard language modeling loss
        logits = self.model(batch.input_ids)
        lm_loss = cross_entropy(logits, batch.labels)

        # Fidelity-weighted loss
        weights = self.compute_fidelity_weights(batch)
        weighted_loss = (lm_loss * weights).mean()

        # Optional: fidelity as auxiliary objective
        fidelity_loss = -self.compute_batch_fidelity(batch)

        total_loss = weighted_loss + 0.3 * fidelity_loss
        return total_loss

Experimental Results

Setup

Base Model: TinyLlama-1.1B-Chat
Training Data: 12 Kali Linux tools (nmap, nikto, sqlmap, etc.)
Method: QLoRA (rank 16, alpha 32)
Hardware: RTX 3090

Version 1: Aggressive Weighting

config = {
    "fidelity_weight": 3.0,  # Strong focus on weak areas
    "learning_rate": 2e-4,
    "epochs": 5
}

Results:

Metric	Baseline	After Training
Selection Accuracy	17.0%	21.4% (+4.4%)
Generation Quality	Good	Degraded

The model improved at tool selection but suffered catastrophic forgetting of generation abilities.

Version 2: Conservative Approach

config = {
    "fidelity_weight": 1.5,  # Gentler weighting
    "learning_rate": 5e-5,   # Lower LR
    "mixed_data": True,      # 80% tools, 20% general
    "gradient_clip": 1.0
}

Results:

Metric	Baseline	After Training
Selection Accuracy	17.0%	17.0% (stable)
Generation Quality	Good	Good
Fidelity (avg)	4.3%	4.5%

Generation preserved, but no improvement. The fine-tuning tradeoff is challenging.

The Self-Improvement Loop

The ultimate goal: a model that continuously improves itself:

┌─────────────────────────────────────────────────┐
│                                                 │
│    User Query                                   │
│        ↓                                        │
│    Compute Fidelity on Query Concepts           │
│        ↓                                        │
│    Low Fidelity?  ──Yes──→  Flag for Learning   │
│        │                         │              │
│        No                        ↓              │
│        ↓                    Generate Training   │
│    Generate Response        Data for Weak Areas │
│        │                         │              │
│        │                         ↓              │
│        │                    Update Weights      │
│        │                         │              │
│        └─────────────────────────┘              │
│                                                 │
└─────────────────────────────────────────────────┘

Key Properties:

No external oracle required
Identifies specific knowledge gaps
Focuses learning on weak areas
Self-limiting (high fidelity = stop training)

Pre-Generation Verification

Another application: verify capability before generating:

def verified_generate(model, prompt, required_concepts):
    """Only generate if model understands the concepts."""

    for concept in required_concepts:
        fidelity = compute_fidelity(model, concept)
        if fidelity < THRESHOLD:
            return {
                "status": "insufficient_knowledge",
                "weak_concepts": [concept],
                "confidence": fidelity
            }

    # Model understands all concepts - safe to generate
    return {
        "status": "success",
        "response": model.generate(prompt)
    }

This enables:

Honest uncertainty: Model knows what it doesn’t know
Graceful degradation: Defer to human on weak areas
Targeted retrieval: Fetch documents for low-fidelity concepts

Scaling Considerations

Computational Cost

Operation	Time (RTX 3090)
Fidelity per token	~0.02ms
Fidelity per sequence (100 tokens)	~2ms
SipIt full inversion (100 tokens)	~600s

Fidelity computation is cheap. Full inversion is expensive but rarely needed.

Memory Requirements

No additional memory beyond standard inference—fidelity uses existing hidden states.

Connection to AI Safety

Fidelity-based self-improvement has safety implications:

Positive

Intrinsic honesty: Model knows its limitations
Bounded improvement: High fidelity naturally limits learning
Interpretable: Can inspect what model does/doesn’t understand

Concerns

Self-modification: Model updating its own weights
Goal drift: Optimizing fidelity might diverge from intended objectives
Deceptive alignment: Could a model fake low fidelity to trigger learning?

Open Research Questions

Optimal weighting: How to balance fidelity loss vs. language modeling loss?
Curriculum learning: Should we train on low-fidelity concepts first or last?
Scaling laws: Does correlation hold for larger models?
Multi-task fidelity: How to measure fidelity for complex, multi-concept tasks?
Continuous learning: Can models train online without catastrophic forgetting?

Code Release

# Coming soon: pip install sipit

from sipit import FidelityTrainer, SipItInverter

# Initialize
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
trainer = FidelityTrainer(model)

# Compute fidelity
fidelity = trainer.compute_fidelity("nmap -sV -p 1-1000 target.com")
print(f"Model understanding: {fidelity:.2%}")

# Train on weak concepts
trainer.train_on_weak_concepts(corpus, fidelity_threshold=0.05)

Conclusion

The core insight: Hidden state quality predicts capability. This enables:

Pre-generation verification without external oracles
Targeted training on specific knowledge gaps
Self-improving systems that know what they don’t know

This is early research with open challenges (catastrophic forgetting, optimal training dynamics), but the correlation is real and the implications are significant.

Next steps:

Scale to larger models (Mistral-7B, Llama-70B)
Test on diverse domains (not just security tools)
Develop anti-forgetting techniques
Explore safety implications

Self-improving AI isn’t science fiction—it’s a research problem we can work on today.