experiment 2026-02-23

How to Build a Multi-Model AI Judge Panel

A methodology for building a reusable, affordable multi-model judge panel with built-in bias detection

AI Collaboration

Date: 2026-02-23 Cost: ~$0.06 per evaluation round (5 judges)

The Problem

Single-model evaluation has well-known failure modes: score compression (everything clusters at 7-10), limited perspective (one model’s biases go undetected), and no way to detect when a judge favors outputs from its own provider. Previous evaluation attempts using 2-judge setups ran into all three.

This is a methodology for building a reusable, affordable multi-model judge panel with built-in bias detection.

The Panel

5 diverse judges, chosen to cover different providers, architectures, and potential biases:

Judge	Provider	Cost/eval	Why This Judge
GPT-4o	OpenAI	~$0.02	Proven neutral in previous evaluations
Llama 3.3 70B	Azure AI Foundry	~$0.03	Open-source perspective
Claude Sonnet 4.6	Anthropic	~$0.007	Bias detection: does Claude prefer Claude?
qwen3.5-plus	DashScope	~$0.001	Non-Western perspective, free tier
GPT-OSS-120B	Cerebras	$0	Speed-silicon perspective, free

Total per round: ~$0.06 for 5 judges. For 3-run averaging: ~$0.18. This is only marginally more expensive than a 2-judge setup at ~$0.05.

Key Design Decisions

Use a 1-5 Scale, Not 1-10

A 10-point scale sounds more precise but produces score compression. Models cluster everything between 7 and 10, making it impossible to distinguish “good” from “great.” A 5-point scale forces real discrimination: 3 is meaningfully different from 4 in a way that 7 isn’t meaningfully different from 8.

Force JSON Output

Structured scores with per-dimension justifications. No free-text scoring that requires parsing.

{
  "agents": {
    "A": {
      "task_completion": { "score": 4, "note": "5/6 tasks done" },
      "code_quality": { "score": 3, "note": "works but messy" }
    }
  },
  "ranking": ["A", "B"],
  "confidence": "high",
  "reasoning": "A's approach was superior..."
}

Track Bias Explicitly

With 5 judges from different providers, you can detect systematic bias. If Claude consistently scores Claude-generated outputs higher than other judges do, that’s a signal. If Llama consistently scores open-source outputs higher, that’s a signal. Two judges can’t give you this — five can.

Use Configurable Rubrics

Different evaluation types need different dimensions. Two examples:

AI agent comparison rubric:

task_completion 30%
code_quality 25%
ux_consideration 20%
problem_solving 15%
efficiency 10%

API swap / model comparison rubric:

output_quality 35%
language_quality 25%
format_compliance 20%
consistency 20%

Domain-specific rubrics (coaching messages, data extraction, metadata generation, narrative voice) follow the same pattern: weighted dimensions that sum to 100%, each scored 1-5.

The Process

Phase 1: Smoke Test

Run a trivial comparison (two Fibonacci implementations) through all 5 judges. Verify all return valid JSON, use the full 1-5 range, and produce coherent rankings. This catches JSON parsing failures and models that refuse to score below 4.

Phase 2: Validate Against Known Results

Feed data from a previous evaluation with known outcomes through the new panel. Expected: the panel should agree with the known result while providing richer signal. If any judge shows systematic bias toward outputs from its own provider family, document it.

Phase 3: Calibrate

If any judge consistently fails to parse JSON or compresses scores, adjust the system prompt or replace that judge. The panel should be stable before using it for real evaluations.

Validation

This methodology was validated through subsequent experiments that confirmed its effectiveness. The panel was reused across multiple evaluation rounds comparing AI agent outputs, model quality assessments, and editorial review pipelines. Key validation results:

All 5 judges reliably returned valid JSON with the forced schema
The 1-5 scale produced meaningful spread (scores ranging from 2 to 5, not clustering)
Bias detection surfaced real patterns (some judges were consistently 0.5 points more generous than others, but no provider self-preference was detected)
The ~$0.06/round cost made it practical to run 3x for averaging without budget concerns

Lessons

1-5 beats 1-10. The compression problem disappears when you halve the scale.
5 judges at $0.06 beats 2 judges at $0.05. The marginal cost of three more judges is negligible; the signal improvement is substantial.
Bias detection is the underrated feature. Even if you never find bias, knowing you would have detected it changes how much you trust the results.
Rubric configuration is essential. A coding rubric applied to a writing task produces meaningless scores. The 30 seconds spent choosing dimensions pays for itself immediately.
The panel is reusable. Once calibrated, the same 5-judge setup works across different experiment types with only the rubric changing.