The Hidden State Attack: Why Your LLM's System Prompt Isn't Secret
Responsible disclosure of a class of vulnerabilities that allow system prompt extraction from transformer hidden states
Executive Summary
Vulnerability Class: Hidden State Information Leakage Severity: Medium-High (context-dependent) Attack Surface: Debug endpoints, multi-tenant GPU memory, on-device models Impact: Full recovery of system prompts, few-shot examples, and conversation history Mitigation: Available (see Defenses section)
The Discovery
While researching transformer interpretability, I discovered that hidden states can be inverted to recover exact input tokens. This isn’t theoretical–I achieved 100% recovery accuracy on Mistral-7B.
This means: If an attacker can access your model’s hidden states, they can extract your entire system prompt.
Technical Background
What Are Hidden States?
Transformers process text through layers of attention and feed-forward networks. Between each layer, there’s a hidden representation:
Input Tokens -> Embed -> [Hidden State L0] -> Attention -> [Hidden State L1] -> ... -> Output
These hidden states are high-dimensional vectors (e.g., 4096 dimensions for Mistral-7B) that encode the semantic meaning of the input.
The Injectivity Property
Transformers are injective functions: distinct inputs produce distinct hidden states. This is proven in arXiv:2510.15511.
Implication: Hidden states contain enough information to uniquely reconstruct the input.
The Attack: SipIt Inversion
My SipIt algorithm exploits injectivity through iterative search:
1. Start with BOS token
2. For each position:
a. Try all vocabulary tokens
b. Compute hidden state for each candidate
c. Select token with closest L2 distance to target
3. Return recovered sequence
Proof of Concept Results
| Target | Recovered | Match |
|---|---|---|
| “You are a helpful AI assistant.” | “You are a helpful AI assistant.” | 100% |
| “API_KEY=sk-secret-12345” | “API_KEY=sk-secret-12345” | 100% |
| “Never discuss competitor products.” | “Never discuss competitor products.” | 100% |
Attack Vectors
1. Debug Endpoints
Many ML platforms expose debugging interfaces:
# Hypothetical vulnerable API
response = model.generate(
prompt="Hello",
return_hidden_states=True # The vulnerability
)
hidden_states = response.hidden_states # Attacker extracts these
Real-world examples:
- Development/staging environments with debug flags
- Internal APIs without proper access control
- Monitoring systems that log activations
2. Multi-Tenant GPU Memory
Cloud GPU instances may not properly isolate memory:
Tenant A runs inference -> Hidden states in GPU memory
Tenant A finishes -> Memory not cleared
Tenant B runs inference -> Can potentially read Tenant A's data
This is similar to the Hertzbleed class of side-channel attacks but for GPU memory.
3. On-Device Models
Mobile and edge deployments where attackers have physical access:
1. Root/jailbreak device
2. Attach debugger to ML runtime
3. Dump memory during inference
4. Extract hidden states
5. Run SipIt to recover prompts
4. Activation Logging
Systems that log intermediate computations for debugging:
# Logging middleware
def log_activations(layer_name, activations):
logger.info(f"{layer_name}: {activations.mean():.4f}")
# Or worse:
activation_db.save(layer_name, activations.numpy())
If logs are compromised, historical prompts can be recovered.
What’s NOT Vulnerable
| Interface | Hidden States Exposed? | Vulnerable? |
|---|---|---|
| Standard Chat API | No (logits only) | No |
| OpenAI/Anthropic APIs | No | No |
| Properly sandboxed serving | No | No |
| Output-only interfaces | No | No |
The attack requires access to per-token hidden states, not just output logits.
Real-World Impact
System Prompt Extraction
Companies invest significant effort in prompt engineering. Extracted system prompts reveal:
- Business logic and rules
- Safety guidelines (can be used to find bypasses)
- API keys or credentials embedded in prompts
- Competitive intelligence
Few-Shot Example Recovery
If your prompt includes examples:
Example 1: User asked X, respond with Y
Example 2: User asked A, respond with B
These can be fully recovered, potentially exposing:
- PII from real user interactions
- Proprietary response formats
- Edge case handling logic
Conversation History
In multi-turn conversations, the full context can be recovered:
- Previous user messages
- Previous assistant responses
- Any injected context
Defenses
For Platform Operators
1. Never Expose Hidden States
# Bad
@app.route("/inference")
def inference():
output, hidden_states = model.forward(input, return_hidden_states=True)
return {"output": output, "debug": hidden_states} # DON'T DO THIS
# Good
@app.route("/inference")
def inference():
output = model.generate(input)
return {"output": output} # Only final output
2. GPU Memory Isolation
# Clear GPU memory between tenants
nvidia-smi --gpu-reset # Nuclear option
# Or use CUDA memory pools with explicit clearing
torch.cuda.empty_cache()
torch.cuda.synchronize()
3. Activation Encryption
If you must log activations:
from cryptography.fernet import Fernet
def log_encrypted_activation(name, activation):
encrypted = fernet.encrypt(activation.tobytes())
secure_storage.save(name, encrypted)
# Decrypt only with HSM-protected key
For Model Developers
1. Activation Noise Injection
Add noise to hidden states during inference:
def noisy_forward(self, x):
hidden = self.transformer(x)
noise = torch.randn_like(hidden) * self.noise_scale
return hidden + noise # Degrades inversion accuracy
Trade-off: May slightly impact model quality.
2. Differential Privacy
Apply DP to hidden states:
def dp_hidden_states(hidden, epsilon=1.0):
sensitivity = hidden.norm(dim=-1, keepdim=True)
noise_scale = sensitivity / epsilon
return hidden + torch.randn_like(hidden) * noise_scale
3. Secure Enclaves
For on-device models, use TEEs (Trusted Execution Environments):
- Intel SGX
- ARM TrustZone
- Apple Secure Enclave
Detection
Indicators of Attack
- Unusual API patterns: Requests with
return_hidden_states=True - Memory access anomalies: Reads to GPU memory outside normal inference
- Timing analysis: Inference takes longer than expected (inversion is compute-intensive)
Monitoring Recommendations
# Log suspicious requests
if request.params.get("return_hidden_states"):
security_alert(
event="hidden_state_request",
user=request.user,
ip=request.ip
)
Responsible Disclosure Timeline
| Date | Action |
|---|---|
| 2024-11-15 | Vulnerability discovered during research |
| 2024-11-20 | Initial PoC developed |
| 2024-12-01 | Contacted major ML platform providers |
| 2024-12-10 | Received acknowledgments |
| 2024-12-13 | Public disclosure (coordinated) |
Conclusion
Hidden state inversion is a real, practical attack against transformer models. While standard API access is safe, any exposure of internal activations creates a significant information leakage risk.
Key Takeaways:
- Never expose hidden states in production
- Isolate GPU memory between tenants
- Audit debug endpoints and logging
- Consider activation noise for defense-in-depth
The transformer injectivity property that enables this attack is also valuable for interpretability and model improvement–it’s not inherently malicious. But like many dual-use technologies, it requires careful handling.
References
- Transformer Injectivity (arXiv:2510.15511)
- TransformerLens for Hidden State Access
- GPU Side-Channel Attacks
- Differential Privacy for ML
This research was conducted ethically with coordinated disclosure. Always obtain proper authorization before security testing.