Sign In
Access to Author tools and Claude Code Assistant requires authentication.
by Adam 5 min read

Self-Improving AI: Using Inversion Fidelity as a Training Signal

A novel discovery: hidden state inversion quality predicts model capability, enabling self-improving systems without external feedback

ai-training self-improvement transformers fidelity novel-research

The Key Discovery

Finding: There’s a strong correlation (r = 0.891) between how well a model’s hidden states encode information and its task performance.

Implication: Models can identify their own knowledge gaps without any external feedback.

This enables a new class of self-improving AI systems that verify and enhance their own understanding.

What is Inversion Fidelity?

Fidelity measures how accurately we can recover input tokens from hidden states:

def compute_fidelity(model, tokens):
    """
    What fraction of tokens appear in top-k predictions
    from their hidden states?
    """
    hidden_states = model.get_hidden_states(tokens)
    logits = model.lm_head(hidden_states)

    correct = 0
    for pos, token in enumerate(tokens):
        top_k = logits[pos].topk(k=5).indices
        if token in top_k:
            correct += 1

    return correct / len(tokens)

High fidelity (~6%+): Model strongly encodes the concept Low fidelity (~4%): Model weakly encodes the concept

The Correlation Experiment

I tested TinyLlama-1.1B on 12 Kali Linux security tools:

ToolFidelitySelection AccuracyParameter Accuracy
nmap6.2%86%67%
nikto5.8%100%80%
sqlmap5.1%40%20%
dirb4.8%20%13%
metasploit4.2%0%0%
hydra4.1%0%0%
john4.0%0%0%

Pearson correlation: r = 0.891 (p < 0.001)

Tools the model “understands” (high fidelity) are used correctly. Tools it doesn’t understand (low fidelity) fail completely.

Why This Matters

Traditional Training Feedback

User Input → Model Output → External Verifier → Reward → Update Weights
                              ↑
                         Requires Oracle

External verification is expensive:

  • Human evaluation: ~$0.10-1.00 per sample
  • Model-as-judge: ~$0.01 per sample
  • Ground truth comparison: Requires labeled data

Fidelity-Based Feedback

User Input → Hidden States → Inversion Fidelity → Self-Reward → Update Weights
                                    ↑
                             No Oracle Needed

The model verifies its own understanding before generating output.

Fidelity-Guided Training Algorithm

class FidelityGuidedTrainer:
    def __init__(self, model, corpus):
        self.model = model
        self.corpus = corpus
        self.sipit = SipItInverter(model)

    def compute_fidelity_weights(self, batch):
        """Weight examples by how poorly the model understands them."""
        weights = []
        for example in batch:
            fidelity = self.sipit.compute_fidelity(example)
            # Low fidelity = high weight (focus on weak areas)
            weight = 1.0 / (fidelity + 0.01)
            weights.append(weight)
        return normalize(weights)

    def train_step(self, batch):
        # Standard language modeling loss
        logits = self.model(batch.input_ids)
        lm_loss = cross_entropy(logits, batch.labels)

        # Fidelity-weighted loss
        weights = self.compute_fidelity_weights(batch)
        weighted_loss = (lm_loss * weights).mean()

        # Optional: fidelity as auxiliary objective
        fidelity_loss = -self.compute_batch_fidelity(batch)

        total_loss = weighted_loss + 0.3 * fidelity_loss
        return total_loss

Experimental Results

Setup

  • Base Model: TinyLlama-1.1B-Chat
  • Training Data: 12 Kali Linux tools (nmap, nikto, sqlmap, etc.)
  • Method: QLoRA (rank 16, alpha 32)
  • Hardware: RTX 3090

Version 1: Aggressive Weighting

config = {
    "fidelity_weight": 3.0,  # Strong focus on weak areas
    "learning_rate": 2e-4,
    "epochs": 5
}

Results:

MetricBaselineAfter Training
Selection Accuracy17.0%21.4% (+4.4%)
Generation QualityGoodDegraded

The model improved at tool selection but suffered catastrophic forgetting of generation abilities.

Version 2: Conservative Approach

config = {
    "fidelity_weight": 1.5,  # Gentler weighting
    "learning_rate": 5e-5,   # Lower LR
    "mixed_data": True,      # 80% tools, 20% general
    "gradient_clip": 1.0
}

Results:

MetricBaselineAfter Training
Selection Accuracy17.0%17.0% (stable)
Generation QualityGoodGood
Fidelity (avg)4.3%4.5%

Generation preserved, but no improvement. The fine-tuning tradeoff is challenging.

The Self-Improvement Loop

The ultimate goal: a model that continuously improves itself:

┌─────────────────────────────────────────────────┐
│                                                 │
│    User Query                                   │
│        ↓                                        │
│    Compute Fidelity on Query Concepts           │
│        ↓                                        │
│    Low Fidelity?  ──Yes──→  Flag for Learning   │
│        │                         │              │
│        No                        ↓              │
│        ↓                    Generate Training   │
│    Generate Response        Data for Weak Areas │
│        │                         │              │
│        │                         ↓              │
│        │                    Update Weights      │
│        │                         │              │
│        └─────────────────────────┘              │
│                                                 │
└─────────────────────────────────────────────────┘

Key Properties:

  1. No external oracle required
  2. Identifies specific knowledge gaps
  3. Focuses learning on weak areas
  4. Self-limiting (high fidelity = stop training)

Pre-Generation Verification

Another application: verify capability before generating:

def verified_generate(model, prompt, required_concepts):
    """Only generate if model understands the concepts."""

    for concept in required_concepts:
        fidelity = compute_fidelity(model, concept)
        if fidelity < THRESHOLD:
            return {
                "status": "insufficient_knowledge",
                "weak_concepts": [concept],
                "confidence": fidelity
            }

    # Model understands all concepts - safe to generate
    return {
        "status": "success",
        "response": model.generate(prompt)
    }

This enables:

  • Honest uncertainty: Model knows what it doesn’t know
  • Graceful degradation: Defer to human on weak areas
  • Targeted retrieval: Fetch documents for low-fidelity concepts

Scaling Considerations

Computational Cost

OperationTime (RTX 3090)
Fidelity per token~0.02ms
Fidelity per sequence (100 tokens)~2ms
SipIt full inversion (100 tokens)~600s

Fidelity computation is cheap. Full inversion is expensive but rarely needed.

Memory Requirements

No additional memory beyond standard inference—fidelity uses existing hidden states.

Connection to AI Safety

Fidelity-based self-improvement has safety implications:

Positive

  • Intrinsic honesty: Model knows its limitations
  • Bounded improvement: High fidelity naturally limits learning
  • Interpretable: Can inspect what model does/doesn’t understand

Concerns

  • Self-modification: Model updating its own weights
  • Goal drift: Optimizing fidelity might diverge from intended objectives
  • Deceptive alignment: Could a model fake low fidelity to trigger learning?

Open Research Questions

  1. Optimal weighting: How to balance fidelity loss vs. language modeling loss?
  2. Curriculum learning: Should we train on low-fidelity concepts first or last?
  3. Scaling laws: Does correlation hold for larger models?
  4. Multi-task fidelity: How to measure fidelity for complex, multi-concept tasks?
  5. Continuous learning: Can models train online without catastrophic forgetting?

Code Release

# Coming soon: pip install sipit

from sipit import FidelityTrainer, SipItInverter

# Initialize
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
trainer = FidelityTrainer(model)

# Compute fidelity
fidelity = trainer.compute_fidelity("nmap -sV -p 1-1000 target.com")
print(f"Model understanding: {fidelity:.2%}")

# Train on weak concepts
trainer.train_on_weak_concepts(corpus, fidelity_threshold=0.05)

Conclusion

The core insight: Hidden state quality predicts capability. This enables:

  1. Pre-generation verification without external oracles
  2. Targeted training on specific knowledge gaps
  3. Self-improving systems that know what they don’t know

This is early research with open challenges (catastrophic forgetting, optimal training dynamics), but the correlation is real and the implications are significant.

Next steps:

  • Scale to larger models (Mistral-7B, Llama-70B)
  • Test on diverse domains (not just security tools)
  • Develop anti-forgetting techniques
  • Explore safety implications

Self-improving AI isn’t science fiction—it’s a research problem we can work on today.

References