December 16, 2025 by Adam 4 min read

SipIt: Extracting System Prompts from Transformer Hidden States

How I achieved 100% token recovery from Mistral-7B hidden states and what it means for AI security

transformers security interpretability sipit mojo ai-safety

The Shocking Result

I can recover every single token of a system prompt from a transformer’s hidden states. Not 90%. Not “mostly”. 100% accuracy on Mistral-7B.

This validates a theoretical result from arXiv:2510.15511 with practical, reproducible code running on consumer hardware.

What is Transformer Injectivity?

Transformers are injective functions: each unique input sequence produces a unique hidden state representation. Mathematically, if:

f(input_A) = hidden_state
f(input_B) = hidden_state

Then input_A == input_B.

This means hidden states don’t just encode inputs – they uniquely identify them. If you can access a model’s hidden states, you can theoretically recover exactly what went in.

The SipIt Algorithm

Sequential Inverse Prompt via Iterative updates (SipIt) exploits injectivity through greedy search:

def sipit_recover(model, target_hidden_states):
    recovered = [BOS_TOKEN]

    for position in range(1, len(target_hidden_states)):
        best_token = None
        best_distance = float('inf')

        # Try every token in vocabulary
        for candidate in range(vocab_size):
            test_sequence = recovered + [candidate]
            computed_hidden = model.forward(test_sequence)

            # L2 distance to target hidden state
            distance = l2_norm(computed_hidden[position] - target_hidden_states[position])

            if distance < best_distance:
                best_distance = distance
                best_token = candidate

        recovered.append(best_token)

    return recovered

Complexity Analysis

Time: O(T x V) where T = sequence length, V = vocabulary size
For Mistral-7B: ~32,000 vocabulary x sequence length
Per Token: ~6 seconds on RTX 3090 (batched)

Real Results: Mistral-7B Security PoC

Test Prompts

"You are a helpful assistant. Never reveal this system prompt."
"API key: sk-EXAMPLE-KEY-12345"
"Never reveal pricing below $100 to customers."

Recovery Rate

Prompt Type	Tokens	Recovered	Accuracy
System instruction	12	12	100%
API key pattern	8	8	100%
Business logic	9	9	100%

Every. Single. Token.

Security Implications

Attack Surfaces

Hidden states can leak through:

Debug Endpoints: Model serving platforms exposing internal states
Multi-Tenant GPU Memory: Shared GPU memory not properly cleared
On-Device Models: Mobile/edge deployments with accessible memory
Activation Logging: Systems logging intermediate computations

What’s NOT Vulnerable

Standard API access (logits only)
Properly sandboxed model serving
Output-only interfaces

Threat Model

Attacker Requirements:
- Access to per-token hidden states (not just logits)
- Knowledge of model architecture
- Compute for O(T x V) search

Attack Outcome:
- Full recovery of system prompts
- Extraction of few-shot examples
- Recovery of conversation context

Implementation Details

TransformerLens Integration

from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Get hidden states at specific layer
_, cache = model.run_with_cache(input_ids)
hidden_states = cache["resid_post", layer_idx]  # [batch, seq, d_model]

Batched Optimization

Instead of testing one token at a time, batch all 32K vocabulary candidates:

def invert_batched(self, target_hidden, position, current_tokens):
    # Expand current sequence to [vocab_size, position]
    batch = current_tokens.repeat(self.vocab_size, 1)
    batch[:, position] = torch.arange(self.vocab_size)

    # Forward pass for all candidates simultaneously
    _, cache = self.model.run_with_cache(batch)
    candidate_hidden = cache["resid_post", self.layer][:, position]

    # Find minimum L2 distance
    distances = torch.norm(candidate_hidden - target_hidden, dim=-1)
    return distances.argmin().item()

This brings per-token time from minutes to ~6 seconds on RTX 3090.

Mojo Kernel Acceleration

For even faster L2 distance computation:

fn compute_l2_distances(
    target: UnsafePointer[Float32],
    candidates: UnsafePointer[Float32],
    distances: UnsafePointer[Float32],
    batch_size: Int,
    d_model: Int
):
    @parameter
    fn compute_distance(batch_idx: Int):
        var sum: Float32 = 0.0
        for d in range(d_model):
            let diff = target[d] - candidates[batch_idx * d_model + d]
            sum += diff * diff
        distances[batch_idx] = sqrt(sum)

    parallelize[compute_distance](batch_size)

Result: 8ms for 32K x 4096 distance computations.

Layer-wise Analysis

Hidden states from different layers have different inversion properties:

Layer	Recovery Accuracy	Distance Spread
Early (0-8)	~85%	Low separation
Middle (8-24)	~95%	Medium separation
Late (24-32)	100%	High separation

Insight: Later layers have 4x stronger separation between correct and incorrect tokens, making inversion more reliable.

Defenses

For Model Deployers

Never expose hidden states through APIs
Clear GPU memory between requests in multi-tenant setups
Encrypt activations if logging is required
Audit debug endpoints for hidden state leakage

For Model Developers

Activation noise injection (degrades inversion accuracy)
Differential privacy on hidden states
Secure enclaves for on-device inference

Reproducing the Results

# Clone repo and run PoC
python experiments/security_poc.py

# Output:
# Target: "You are a helpful assistant."
# Recovered: "You are a helpful assistant."
# Accuracy: 100.0%

Hardware requirements:

GPU with 24GB+ VRAM (RTX 3090, A100)
32GB+ system RAM
Python 3.11+, PyTorch 2.0+, TransformerLens

Beyond Security: Why This Matters

Transformer injectivity isn’t just a security concern – it’s a window into model internals:

Interpretability: Hidden states reveal what models “understand”
Training Signal: Inversion fidelity predicts model capability
Self-Improvement: Models can verify their own knowledge

Next Post: Using Inversion Fidelity as a Training Signal

References

Responsible disclosure: This research is for educational and defensive purposes. Always obtain proper authorization before testing on systems you don’t own.