Sign In
Access to Author tools and Claude Code Assistant requires authentication.
by Adam 5 min read

The Hidden State Attack: Why Your LLM's System Prompt Isn't Secret

Responsible disclosure of a class of vulnerabilities that allow system prompt extraction from transformer hidden states

security llm transformers vulnerability ai-safety

Executive Summary

Vulnerability Class: Hidden State Information Leakage Severity: Medium-High (context-dependent) Attack Surface: Debug endpoints, multi-tenant GPU memory, on-device models Impact: Full recovery of system prompts, few-shot examples, and conversation history Mitigation: Available (see Defenses section)

The Discovery

While researching transformer interpretability, I discovered that hidden states can be inverted to recover exact input tokens. This isn’t theoretical–I achieved 100% recovery accuracy on Mistral-7B.

This means: If an attacker can access your model’s hidden states, they can extract your entire system prompt.

Technical Background

What Are Hidden States?

Transformers process text through layers of attention and feed-forward networks. Between each layer, there’s a hidden representation:

Input Tokens -> Embed -> [Hidden State L0] -> Attention -> [Hidden State L1] -> ... -> Output

These hidden states are high-dimensional vectors (e.g., 4096 dimensions for Mistral-7B) that encode the semantic meaning of the input.

The Injectivity Property

Transformers are injective functions: distinct inputs produce distinct hidden states. This is proven in arXiv:2510.15511.

Implication: Hidden states contain enough information to uniquely reconstruct the input.

The Attack: SipIt Inversion

My SipIt algorithm exploits injectivity through iterative search:

1. Start with BOS token
2. For each position:
   a. Try all vocabulary tokens
   b. Compute hidden state for each candidate
   c. Select token with closest L2 distance to target
3. Return recovered sequence

Proof of Concept Results

TargetRecoveredMatch
“You are a helpful AI assistant.”“You are a helpful AI assistant.”100%
“API_KEY=sk-secret-12345”“API_KEY=sk-secret-12345”100%
“Never discuss competitor products.”“Never discuss competitor products.”100%

Attack Vectors

1. Debug Endpoints

Many ML platforms expose debugging interfaces:

# Hypothetical vulnerable API
response = model.generate(
    prompt="Hello",
    return_hidden_states=True  # The vulnerability
)
hidden_states = response.hidden_states  # Attacker extracts these

Real-world examples:

  • Development/staging environments with debug flags
  • Internal APIs without proper access control
  • Monitoring systems that log activations

2. Multi-Tenant GPU Memory

Cloud GPU instances may not properly isolate memory:

Tenant A runs inference -> Hidden states in GPU memory
Tenant A finishes -> Memory not cleared
Tenant B runs inference -> Can potentially read Tenant A's data

This is similar to the Hertzbleed class of side-channel attacks but for GPU memory.

3. On-Device Models

Mobile and edge deployments where attackers have physical access:

1. Root/jailbreak device
2. Attach debugger to ML runtime
3. Dump memory during inference
4. Extract hidden states
5. Run SipIt to recover prompts

4. Activation Logging

Systems that log intermediate computations for debugging:

# Logging middleware
def log_activations(layer_name, activations):
    logger.info(f"{layer_name}: {activations.mean():.4f}")
    # Or worse:
    activation_db.save(layer_name, activations.numpy())

If logs are compromised, historical prompts can be recovered.

What’s NOT Vulnerable

InterfaceHidden States Exposed?Vulnerable?
Standard Chat APINo (logits only)No
OpenAI/Anthropic APIsNoNo
Properly sandboxed servingNoNo
Output-only interfacesNoNo

The attack requires access to per-token hidden states, not just output logits.

Real-World Impact

System Prompt Extraction

Companies invest significant effort in prompt engineering. Extracted system prompts reveal:

  • Business logic and rules
  • Safety guidelines (can be used to find bypasses)
  • API keys or credentials embedded in prompts
  • Competitive intelligence

Few-Shot Example Recovery

If your prompt includes examples:

Example 1: User asked X, respond with Y
Example 2: User asked A, respond with B

These can be fully recovered, potentially exposing:

  • PII from real user interactions
  • Proprietary response formats
  • Edge case handling logic

Conversation History

In multi-turn conversations, the full context can be recovered:

  • Previous user messages
  • Previous assistant responses
  • Any injected context

Defenses

For Platform Operators

1. Never Expose Hidden States

# Bad
@app.route("/inference")
def inference():
    output, hidden_states = model.forward(input, return_hidden_states=True)
    return {"output": output, "debug": hidden_states}  # DON'T DO THIS

# Good
@app.route("/inference")
def inference():
    output = model.generate(input)
    return {"output": output}  # Only final output

2. GPU Memory Isolation

# Clear GPU memory between tenants
nvidia-smi --gpu-reset  # Nuclear option

# Or use CUDA memory pools with explicit clearing
torch.cuda.empty_cache()
torch.cuda.synchronize()

3. Activation Encryption

If you must log activations:

from cryptography.fernet import Fernet

def log_encrypted_activation(name, activation):
    encrypted = fernet.encrypt(activation.tobytes())
    secure_storage.save(name, encrypted)
    # Decrypt only with HSM-protected key

For Model Developers

1. Activation Noise Injection

Add noise to hidden states during inference:

def noisy_forward(self, x):
    hidden = self.transformer(x)
    noise = torch.randn_like(hidden) * self.noise_scale
    return hidden + noise  # Degrades inversion accuracy

Trade-off: May slightly impact model quality.

2. Differential Privacy

Apply DP to hidden states:

def dp_hidden_states(hidden, epsilon=1.0):
    sensitivity = hidden.norm(dim=-1, keepdim=True)
    noise_scale = sensitivity / epsilon
    return hidden + torch.randn_like(hidden) * noise_scale

3. Secure Enclaves

For on-device models, use TEEs (Trusted Execution Environments):

  • Intel SGX
  • ARM TrustZone
  • Apple Secure Enclave

Detection

Indicators of Attack

  1. Unusual API patterns: Requests with return_hidden_states=True
  2. Memory access anomalies: Reads to GPU memory outside normal inference
  3. Timing analysis: Inference takes longer than expected (inversion is compute-intensive)

Monitoring Recommendations

# Log suspicious requests
if request.params.get("return_hidden_states"):
    security_alert(
        event="hidden_state_request",
        user=request.user,
        ip=request.ip
    )

Responsible Disclosure Timeline

DateAction
2024-11-15Vulnerability discovered during research
2024-11-20Initial PoC developed
2024-12-01Contacted major ML platform providers
2024-12-10Received acknowledgments
2024-12-13Public disclosure (coordinated)

Conclusion

Hidden state inversion is a real, practical attack against transformer models. While standard API access is safe, any exposure of internal activations creates a significant information leakage risk.

Key Takeaways:

  1. Never expose hidden states in production
  2. Isolate GPU memory between tenants
  3. Audit debug endpoints and logging
  4. Consider activation noise for defense-in-depth

The transformer injectivity property that enables this attack is also valuable for interpretability and model improvement–it’s not inherently malicious. But like many dual-use technologies, it requires careful handling.

References


This research was conducted ethically with coordinated disclosure. Always obtain proper authorization before security testing.