Sign In
Access to Author tools and Claude Code Assistant requires authentication.
by Adam 5 min read

Beyond Benchmark Gaming: Multi-Model Consensus for Genuinely Capable AI

Using agreement among frontier models (Claude, GPT-5, Gemini) as a training signal to build AI that's genuinely capable, not just benchmark-optimized

ai-training consensus goodhart llm multi-model novel-research

The Problem Nobody Talks About

There’s a dirty secret in AI development: our models are getting really good at looking smart without actually being smart.

When we train language models on benchmarks, we’re essentially teaching them to optimize for a score. And like students who learn to ace standardized tests without understanding the material, our models have become experts at gaming metrics rather than developing genuine capabilities.

This is Goodhart’s Law in action: “When a measure becomes a target, it ceases to be a good measure.”

The Insight: Consensus as a Proxy for Understanding

Imagine you’re a student learning math from worked examples. You have access to solutions from three independent tutors. When all three arrive at the same answer through different reasoning paths, you can study that example with confidence–multiple independent minds validated both the answer and the approach.

But when they disagree? That’s a signal to be careful. Maybe the problem is ambiguous. Maybe one tutor made an error. Either way, it’s not the example you want to drill into your head as ground truth.

This is the same principle behind scientific reproducibility: results that multiple independent labs can replicate are more trustworthy than one-off findings.

This is the core insight behind Consensus-Guided Recursive Training (CGRT): instead of training models to match a single ground-truth label, we train them to match answers that multiple independent models agree on.

The hypothesis is simple but powerful: consensus among diverse models is a stronger signal of genuine understanding than any single label.

Inspiration

This work builds on ideas from two key sources:

  • Andrej Karpathy’s observations on benchmark gaming and the gap between test performance and genuine capability–his critiques of how we evaluate models helped frame this problem
  • LLM-Council–a framework for multi-model deliberation that demonstrated the power of aggregating diverse model perspectives for more robust outputs

CGRT extends these ideas into the training domain: rather than just using consensus for inference-time decisions, we use it as a supervision signal.

What We Built

Three frontier models working in parallel:

ModelProvider
Claude Sonnet 4.5Anthropic
GPT-5OpenAI
Gemini 2.5 ProGoogle

Each model independently solves 56,725 grade-school math problems from the GSM8K dataset. For each problem, we record:

  • Each model’s step-by-step reasoning
  • Their final numerical answer
  • Whether they agree with each other
  • The confidence level (unanimous, majority, or split)

Early results are striking: 75% of problems see unanimous agreement across all three frontier models. These are the “easy” problems where we can be confident the models truly understand the underlying math.

But the remaining 25%? That’s where it gets interesting. When frontier models disagree, it signals either:

  1. The problem is genuinely hard
  2. There’s ambiguity in the question
  3. At least one model is using flawed reasoning

These disagreement cases are gold for understanding where current AI fails.

Why This Matters for Training Smaller Models

When training a smaller, more efficient model, we use consensus signals to weight training data:

Consensus LevelTraining Strategy
High (unanimous)Train aggressively–model should definitely learn this
Low (majority)Train cautiously with lower weight, teach uncertainty
Zero (no agreement)Study these–teach robustness rather than rote answers

This approach has a beautiful property: it’s self-correcting. If one teacher model has a systematic bias or shortcut, the other two will disagree on problems where that shortcut fails. The consensus naturally filters out single-model gaming.

The Dataset We’re Releasing

The dataset will be released on HuggingFace at Adam1010/gsm8k-multi-model-consensus.

This is, to my knowledge, the first public dataset with:

  • Multiple frontier model responses per problem
  • Agreement/consensus metadata
  • Chain-of-thought reasoning from each model
  • Difficulty calibration based on model disagreement

Total generation cost: ~$850 in API calls. As a solo, self-taught developer, I know that’s a real barrier–it certainly was for me. I believe this data has value, and by sharing it freely, I hope it helps others like me who want to explore these ideas without burning through their API budget first.

Early Experimental Results

Preliminary CGRT experiments training a small model (125M parameters) using consensus-weighted loss:

MetricResult
Discrimination accuracy100% (reliably distinguishes good from bad reasoning)
Margin improvement+0.226 (stronger preference for correct answers)
GeneralizationBetter performance on held-out disagreement cases

The model isn’t just memorizing–it’s learning something closer to actual mathematical reasoning.

What’s Next

The approach generalizes beyond math:

  • Code generation: Do multiple models produce code that passes the same tests?
  • Factual questions: Do models agree on verifiable facts?
  • Reasoning chains: Do independent models reach conclusions through similar logical steps?

The core idea–using multi-model consensus as a training signal–could fundamentally change how we develop AI systems that are genuinely capable rather than merely benchmark-optimized.

Why This Matters

We’re at an inflection point in AI development. Models are getting more powerful, but we’re not always sure they’re getting more reliable. The gap between benchmark performance and real-world capability is a growing concern.

Consensus-guided training is one path toward closing that gap. By grounding our training in agreement among diverse, independently-developed models, we create a more robust signal of genuine understanding.

It’s not a silver bullet. But it’s a step toward AI systems we can actually trust, maybe.


The GSM8K Multi-Model Consensus dataset will be available at huggingface.co/datasets/Adam1010/gsm8k-multi-model-consensus once generation completes.