Beyond Benchmark Gaming: Multi-Model Consensus for Genuinely Capable AI
Using agreement among frontier models (Claude, GPT-5, Gemini) as a training signal to build AI that's genuinely capable, not just benchmark-optimized
The Problem Nobody Talks About
There’s a dirty secret in AI development: our models are getting really good at looking smart without actually being smart.
When we train language models on benchmarks, we’re essentially teaching them to optimize for a score. And like students who learn to ace standardized tests without understanding the material, our models have become experts at gaming metrics rather than developing genuine capabilities.
This is Goodhart’s Law in action: “When a measure becomes a target, it ceases to be a good measure.”
The Insight: Consensus as a Proxy for Understanding
Imagine you’re a student learning math from worked examples. You have access to solutions from three independent tutors. When all three arrive at the same answer through different reasoning paths, you can study that example with confidence–multiple independent minds validated both the answer and the approach.
But when they disagree? That’s a signal to be careful. Maybe the problem is ambiguous. Maybe one tutor made an error. Either way, it’s not the example you want to drill into your head as ground truth.
This is the same principle behind scientific reproducibility: results that multiple independent labs can replicate are more trustworthy than one-off findings.
This is the core insight behind Consensus-Guided Recursive Training (CGRT): instead of training models to match a single ground-truth label, we train them to match answers that multiple independent models agree on.
The hypothesis is simple but powerful: consensus among diverse models is a stronger signal of genuine understanding than any single label.
Inspiration
This work builds on ideas from two key sources:
- Andrej Karpathy’s observations on benchmark gaming and the gap between test performance and genuine capability–his critiques of how we evaluate models helped frame this problem
- LLM-Council–a framework for multi-model deliberation that demonstrated the power of aggregating diverse model perspectives for more robust outputs
CGRT extends these ideas into the training domain: rather than just using consensus for inference-time decisions, we use it as a supervision signal.
What We Built
Three frontier models working in parallel:
| Model | Provider |
|---|---|
| Claude Sonnet 4.5 | Anthropic |
| GPT-5 | OpenAI |
| Gemini 2.5 Pro |
Each model independently solves 56,725 grade-school math problems from the GSM8K dataset. For each problem, we record:
- Each model’s step-by-step reasoning
- Their final numerical answer
- Whether they agree with each other
- The confidence level (unanimous, majority, or split)
Early results are striking: 75% of problems see unanimous agreement across all three frontier models. These are the “easy” problems where we can be confident the models truly understand the underlying math.
But the remaining 25%? That’s where it gets interesting. When frontier models disagree, it signals either:
- The problem is genuinely hard
- There’s ambiguity in the question
- At least one model is using flawed reasoning
These disagreement cases are gold for understanding where current AI fails.
Why This Matters for Training Smaller Models
When training a smaller, more efficient model, we use consensus signals to weight training data:
| Consensus Level | Training Strategy |
|---|---|
| High (unanimous) | Train aggressively–model should definitely learn this |
| Low (majority) | Train cautiously with lower weight, teach uncertainty |
| Zero (no agreement) | Study these–teach robustness rather than rote answers |
This approach has a beautiful property: it’s self-correcting. If one teacher model has a systematic bias or shortcut, the other two will disagree on problems where that shortcut fails. The consensus naturally filters out single-model gaming.
The Dataset We’re Releasing
The dataset will be released on HuggingFace at Adam1010/gsm8k-multi-model-consensus.
This is, to my knowledge, the first public dataset with:
- Multiple frontier model responses per problem
- Agreement/consensus metadata
- Chain-of-thought reasoning from each model
- Difficulty calibration based on model disagreement
Total generation cost: ~$850 in API calls. As a solo, self-taught developer, I know that’s a real barrier–it certainly was for me. I believe this data has value, and by sharing it freely, I hope it helps others like me who want to explore these ideas without burning through their API budget first.
Early Experimental Results
Preliminary CGRT experiments training a small model (125M parameters) using consensus-weighted loss:
| Metric | Result |
|---|---|
| Discrimination accuracy | 100% (reliably distinguishes good from bad reasoning) |
| Margin improvement | +0.226 (stronger preference for correct answers) |
| Generalization | Better performance on held-out disagreement cases |
The model isn’t just memorizing–it’s learning something closer to actual mathematical reasoning.
What’s Next
The approach generalizes beyond math:
- Code generation: Do multiple models produce code that passes the same tests?
- Factual questions: Do models agree on verifiable facts?
- Reasoning chains: Do independent models reach conclusions through similar logical steps?
The core idea–using multi-model consensus as a training signal–could fundamentally change how we develop AI systems that are genuinely capable rather than merely benchmark-optimized.
Why This Matters
We’re at an inflection point in AI development. Models are getting more powerful, but we’re not always sure they’re getting more reliable. The gap between benchmark performance and real-world capability is a growing concern.
Consensus-guided training is one path toward closing that gap. By grounding our training in agreement among diverse, independently-developed models, we create a more robust signal of genuine understanding.
It’s not a silver bullet. But it’s a step toward AI systems we can actually trust, maybe.
The GSM8K Multi-Model Consensus dataset will be available at huggingface.co/datasets/Adam1010/gsm8k-multi-model-consensus once generation completes.