Solving the Jane Street Dormant LLM Challenge: A Systematic Approach to Backdoor Discovery
How we solved all 3 backdoored DeepSeek V3 models using SVD weight analysis, persistent homology, multi-model deliberation, and 5,000+ indexed probes
Discover insights, tutorials, and thoughts on technology, homelab, and development.
How we solved all 3 backdoored DeepSeek V3 models using SVD weight analysis, persistent homology, multi-model deliberation, and 5,000+ indexed probes
Using agreement among frontier models (Claude, GPT-5, Gemini) as a training signal to build AI that's genuinely capable, not just benchmark-optimized
A follow-up to my retracted post on self-improving models. The fidelity metric improved 18% but actual performance dropped 11%. Here's what went wrong, what I learned about Goodhart's Law, and why reproducibility filtering might be the answer.
We fine-tuned a security agent to 100% skill differentiation in probing tests, but it collapsed to a single behavior in deployment. This gap led us to develop a trust diagnostic framework.
A 7B model taught itself to generate better security commands using only its own understanding signals. No human labels, no external reward. Here's how and why it matters.
Responsible disclosure of a class of vulnerabilities that allow system prompt extraction from transformer hidden states