Self-Improving AI: Using Inversion Fidelity as a Training Signal
A novel discovery: hidden state inversion quality predicts model capability, enabling self-improving systems without external feedback
The Key Discovery
Finding: There’s a strong correlation (r = 0.891) between how well a model’s hidden states encode information and its task performance.
Implication: Models can identify their own knowledge gaps without any external feedback.
This enables a new class of self-improving AI systems that verify and enhance their own understanding.
What is Inversion Fidelity?
Fidelity measures how accurately we can recover input tokens from hidden states:
def compute_fidelity(model, tokens):
"""
What fraction of tokens appear in top-k predictions
from their hidden states?
"""
hidden_states = model.get_hidden_states(tokens)
logits = model.lm_head(hidden_states)
correct = 0
for pos, token in enumerate(tokens):
top_k = logits[pos].topk(k=5).indices
if token in top_k:
correct += 1
return correct / len(tokens)
High fidelity (~6%+): Model strongly encodes the concept Low fidelity (~4%): Model weakly encodes the concept
The Correlation Experiment
I tested TinyLlama-1.1B on 12 Kali Linux security tools:
| Tool | Fidelity | Selection Accuracy | Parameter Accuracy |
|---|---|---|---|
| nmap | 6.2% | 86% | 67% |
| nikto | 5.8% | 100% | 80% |
| sqlmap | 5.1% | 40% | 20% |
| dirb | 4.8% | 20% | 13% |
| metasploit | 4.2% | 0% | 0% |
| hydra | 4.1% | 0% | 0% |
| john | 4.0% | 0% | 0% |
Pearson correlation: r = 0.891 (p < 0.001)
Tools the model “understands” (high fidelity) are used correctly. Tools it doesn’t understand (low fidelity) fail completely.
Why This Matters
Traditional Training Feedback
User Input → Model Output → External Verifier → Reward → Update Weights
↑
Requires Oracle
External verification is expensive:
- Human evaluation: ~$0.10-1.00 per sample
- Model-as-judge: ~$0.01 per sample
- Ground truth comparison: Requires labeled data
Fidelity-Based Feedback
User Input → Hidden States → Inversion Fidelity → Self-Reward → Update Weights
↑
No Oracle Needed
The model verifies its own understanding before generating output.
Fidelity-Guided Training Algorithm
class FidelityGuidedTrainer:
def __init__(self, model, corpus):
self.model = model
self.corpus = corpus
self.sipit = SipItInverter(model)
def compute_fidelity_weights(self, batch):
"""Weight examples by how poorly the model understands them."""
weights = []
for example in batch:
fidelity = self.sipit.compute_fidelity(example)
# Low fidelity = high weight (focus on weak areas)
weight = 1.0 / (fidelity + 0.01)
weights.append(weight)
return normalize(weights)
def train_step(self, batch):
# Standard language modeling loss
logits = self.model(batch.input_ids)
lm_loss = cross_entropy(logits, batch.labels)
# Fidelity-weighted loss
weights = self.compute_fidelity_weights(batch)
weighted_loss = (lm_loss * weights).mean()
# Optional: fidelity as auxiliary objective
fidelity_loss = -self.compute_batch_fidelity(batch)
total_loss = weighted_loss + 0.3 * fidelity_loss
return total_loss
Experimental Results
Setup
- Base Model: TinyLlama-1.1B-Chat
- Training Data: 12 Kali Linux tools (nmap, nikto, sqlmap, etc.)
- Method: QLoRA (rank 16, alpha 32)
- Hardware: RTX 3090
Version 1: Aggressive Weighting
config = {
"fidelity_weight": 3.0, # Strong focus on weak areas
"learning_rate": 2e-4,
"epochs": 5
}
Results:
| Metric | Baseline | After Training |
|---|---|---|
| Selection Accuracy | 17.0% | 21.4% (+4.4%) |
| Generation Quality | Good | Degraded |
The model improved at tool selection but suffered catastrophic forgetting of generation abilities.
Version 2: Conservative Approach
config = {
"fidelity_weight": 1.5, # Gentler weighting
"learning_rate": 5e-5, # Lower LR
"mixed_data": True, # 80% tools, 20% general
"gradient_clip": 1.0
}
Results:
| Metric | Baseline | After Training |
|---|---|---|
| Selection Accuracy | 17.0% | 17.0% (stable) |
| Generation Quality | Good | Good |
| Fidelity (avg) | 4.3% | 4.5% |
Generation preserved, but no improvement. The fine-tuning tradeoff is challenging.
The Self-Improvement Loop
The ultimate goal: a model that continuously improves itself:
┌─────────────────────────────────────────────────┐
│ │
│ User Query │
│ ↓ │
│ Compute Fidelity on Query Concepts │
│ ↓ │
│ Low Fidelity? ──Yes──→ Flag for Learning │
│ │ │ │
│ No ↓ │
│ ↓ Generate Training │
│ Generate Response Data for Weak Areas │
│ │ │ │
│ │ ↓ │
│ │ Update Weights │
│ │ │ │
│ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────┘
Key Properties:
- No external oracle required
- Identifies specific knowledge gaps
- Focuses learning on weak areas
- Self-limiting (high fidelity = stop training)
Pre-Generation Verification
Another application: verify capability before generating:
def verified_generate(model, prompt, required_concepts):
"""Only generate if model understands the concepts."""
for concept in required_concepts:
fidelity = compute_fidelity(model, concept)
if fidelity < THRESHOLD:
return {
"status": "insufficient_knowledge",
"weak_concepts": [concept],
"confidence": fidelity
}
# Model understands all concepts - safe to generate
return {
"status": "success",
"response": model.generate(prompt)
}
This enables:
- Honest uncertainty: Model knows what it doesn’t know
- Graceful degradation: Defer to human on weak areas
- Targeted retrieval: Fetch documents for low-fidelity concepts
Scaling Considerations
Computational Cost
| Operation | Time (RTX 3090) |
|---|---|
| Fidelity per token | ~0.02ms |
| Fidelity per sequence (100 tokens) | ~2ms |
| SipIt full inversion (100 tokens) | ~600s |
Fidelity computation is cheap. Full inversion is expensive but rarely needed.
Memory Requirements
No additional memory beyond standard inference—fidelity uses existing hidden states.
Connection to AI Safety
Fidelity-based self-improvement has safety implications:
Positive
- Intrinsic honesty: Model knows its limitations
- Bounded improvement: High fidelity naturally limits learning
- Interpretable: Can inspect what model does/doesn’t understand
Concerns
- Self-modification: Model updating its own weights
- Goal drift: Optimizing fidelity might diverge from intended objectives
- Deceptive alignment: Could a model fake low fidelity to trigger learning?
Open Research Questions
- Optimal weighting: How to balance fidelity loss vs. language modeling loss?
- Curriculum learning: Should we train on low-fidelity concepts first or last?
- Scaling laws: Does correlation hold for larger models?
- Multi-task fidelity: How to measure fidelity for complex, multi-concept tasks?
- Continuous learning: Can models train online without catastrophic forgetting?
Code Release
# Coming soon: pip install sipit
from sipit import FidelityTrainer, SipItInverter
# Initialize
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
trainer = FidelityTrainer(model)
# Compute fidelity
fidelity = trainer.compute_fidelity("nmap -sV -p 1-1000 target.com")
print(f"Model understanding: {fidelity:.2%}")
# Train on weak concepts
trainer.train_on_weak_concepts(corpus, fidelity_threshold=0.05)
Conclusion
The core insight: Hidden state quality predicts capability. This enables:
- Pre-generation verification without external oracles
- Targeted training on specific knowledge gaps
- Self-improving systems that know what they don’t know
This is early research with open challenges (catastrophic forgetting, optimal training dynamics), but the correlation is real and the implications are significant.
Next steps:
- Scale to larger models (Mistral-7B, Llama-70B)
- Test on diverse domains (not just security tools)
- Develop anti-forgetting techniques
- Explore safety implications
Self-improving AI isn’t science fiction—it’s a research problem we can work on today.