Beyond Benchmark Gaming: Multi-Model Consensus for Genuinely Capable AI
Using agreement among frontier models (Claude, GPT-5, Gemini) as a training signal to build AI that's genuinely capable, not just benchmark-optimized
Discover insights, tutorials, and thoughts on technology, homelab, and development.
Using agreement among frontier models (Claude, GPT-5, Gemini) as a training signal to build AI that's genuinely capable, not just benchmark-optimized
A follow-up to my retracted post on self-improving models. The fidelity metric improved 18% but actual performance dropped 11%. Here's what went wrong, what I learned about Goodhart's Law, and why reproducibility filtering might be the answer.
We fine-tuned a security agent to 100% skill differentiation in probing tests, but it collapsed to a single behavior in deployment. This gap led us to develop a trust diagnostic framework.
A 7B model taught itself to generate better security commands using only its own understanding signals. No human labels, no external reward. Here's how and why it matters.
Responsible disclosure of a class of vulnerabilities that allow system prompt extraction from transformer hidden states