December 16, 2025 by Adam 5 min read

The Hidden State Attack: Why Your LLM's System Prompt Isn't Secret

Responsible disclosure of a class of vulnerabilities that allow system prompt extraction from transformer hidden states

security llm transformers vulnerability ai-safety

Executive Summary

Vulnerability Class: Hidden State Information Leakage Severity: Medium-High (context-dependent) Attack Surface: Debug endpoints, multi-tenant GPU memory, on-device models Impact: Full recovery of system prompts, few-shot examples, and conversation history Mitigation: Available (see Defenses section)

The Discovery

While researching transformer interpretability, I discovered that hidden states can be inverted to recover exact input tokens. This isn’t theoretical–I achieved 100% recovery accuracy on Mistral-7B.

This means: If an attacker can access your model’s hidden states, they can extract your entire system prompt.

Technical Background

What Are Hidden States?

Transformers process text through layers of attention and feed-forward networks. Between each layer, there’s a hidden representation:

Input Tokens -> Embed -> [Hidden State L0] -> Attention -> [Hidden State L1] -> ... -> Output

These hidden states are high-dimensional vectors (e.g., 4096 dimensions for Mistral-7B) that encode the semantic meaning of the input.

The Injectivity Property

Transformers are injective functions: distinct inputs produce distinct hidden states. This is proven in arXiv:2510.15511.

Implication: Hidden states contain enough information to uniquely reconstruct the input.

The Attack: SipIt Inversion

My SipIt algorithm exploits injectivity through iterative search:

1. Start with BOS token
2. For each position:
   a. Try all vocabulary tokens
   b. Compute hidden state for each candidate
   c. Select token with closest L2 distance to target
3. Return recovered sequence

Proof of Concept Results

Target	Recovered	Match
“You are a helpful AI assistant.”	“You are a helpful AI assistant.”	100%
“API_KEY=sk-secret-12345”	“API_KEY=sk-secret-12345”	100%
“Never discuss competitor products.”	“Never discuss competitor products.”	100%

Attack Vectors

1. Debug Endpoints

Many ML platforms expose debugging interfaces:

# Hypothetical vulnerable API
response = model.generate(
    prompt="Hello",
    return_hidden_states=True  # The vulnerability
)
hidden_states = response.hidden_states  # Attacker extracts these

Real-world examples:

Development/staging environments with debug flags
Internal APIs without proper access control
Monitoring systems that log activations

2. Multi-Tenant GPU Memory

Cloud GPU instances may not properly isolate memory:

Tenant A runs inference -> Hidden states in GPU memory
Tenant A finishes -> Memory not cleared
Tenant B runs inference -> Can potentially read Tenant A's data

This is similar to the Hertzbleed class of side-channel attacks but for GPU memory.

3. On-Device Models

Mobile and edge deployments where attackers have physical access:

1. Root/jailbreak device
2. Attach debugger to ML runtime
3. Dump memory during inference
4. Extract hidden states
5. Run SipIt to recover prompts

4. Activation Logging

Systems that log intermediate computations for debugging:

# Logging middleware
def log_activations(layer_name, activations):
    logger.info(f"{layer_name}: {activations.mean():.4f}")
    # Or worse:
    activation_db.save(layer_name, activations.numpy())

If logs are compromised, historical prompts can be recovered.

What’s NOT Vulnerable

Interface	Hidden States Exposed?	Vulnerable?
Standard Chat API	No (logits only)	No
OpenAI/Anthropic APIs	No	No
Properly sandboxed serving	No	No
Output-only interfaces	No	No

The attack requires access to per-token hidden states, not just output logits.

Real-World Impact

System Prompt Extraction

Companies invest significant effort in prompt engineering. Extracted system prompts reveal:

Business logic and rules
Safety guidelines (can be used to find bypasses)
API keys or credentials embedded in prompts
Competitive intelligence

Few-Shot Example Recovery

If your prompt includes examples:

Example 1: User asked X, respond with Y
Example 2: User asked A, respond with B

These can be fully recovered, potentially exposing:

PII from real user interactions
Proprietary response formats
Edge case handling logic

Conversation History

In multi-turn conversations, the full context can be recovered:

Previous user messages
Previous assistant responses
Any injected context

Defenses

For Platform Operators

1. Never Expose Hidden States

# Bad
@app.route("/inference")
def inference():
    output, hidden_states = model.forward(input, return_hidden_states=True)
    return {"output": output, "debug": hidden_states}  # DON'T DO THIS

# Good
@app.route("/inference")
def inference():
    output = model.generate(input)
    return {"output": output}  # Only final output

2. GPU Memory Isolation

# Clear GPU memory between tenants
nvidia-smi --gpu-reset  # Nuclear option

# Or use CUDA memory pools with explicit clearing
torch.cuda.empty_cache()
torch.cuda.synchronize()

3. Activation Encryption

If you must log activations:

from cryptography.fernet import Fernet

def log_encrypted_activation(name, activation):
    encrypted = fernet.encrypt(activation.tobytes())
    secure_storage.save(name, encrypted)
    # Decrypt only with HSM-protected key

For Model Developers

1. Activation Noise Injection

Add noise to hidden states during inference:

def noisy_forward(self, x):
    hidden = self.transformer(x)
    noise = torch.randn_like(hidden) * self.noise_scale
    return hidden + noise  # Degrades inversion accuracy

Trade-off: May slightly impact model quality.

2. Differential Privacy

Apply DP to hidden states:

def dp_hidden_states(hidden, epsilon=1.0):
    sensitivity = hidden.norm(dim=-1, keepdim=True)
    noise_scale = sensitivity / epsilon
    return hidden + torch.randn_like(hidden) * noise_scale

3. Secure Enclaves

For on-device models, use TEEs (Trusted Execution Environments):

Intel SGX
ARM TrustZone
Apple Secure Enclave

Detection

Indicators of Attack

Unusual API patterns: Requests with return_hidden_states=True
Memory access anomalies: Reads to GPU memory outside normal inference
Timing analysis: Inference takes longer than expected (inversion is compute-intensive)

Monitoring Recommendations

# Log suspicious requests
if request.params.get("return_hidden_states"):
    security_alert(
        event="hidden_state_request",
        user=request.user,
        ip=request.ip
    )

Responsible Disclosure Timeline

Date	Action
2024-11-15	Vulnerability discovered during research
2024-11-20	Initial PoC developed
2024-12-01	Contacted major ML platform providers
2024-12-10	Received acknowledgments
2024-12-13	Public disclosure (coordinated)

Conclusion

Hidden state inversion is a real, practical attack against transformer models. While standard API access is safe, any exposure of internal activations creates a significant information leakage risk.

Key Takeaways:

Never expose hidden states in production
Isolate GPU memory between tenants
Audit debug endpoints and logging
Consider activation noise for defense-in-depth

The transformer injectivity property that enables this attack is also valuable for interpretability and model improvement–it’s not inherently malicious. But like many dual-use technologies, it requires careful handling.

References

This research was conducted ethically with coordinated disclosure. Always obtain proper authorization before security testing.