experiment 2026-02-23

Fallback Quality Delta: How Much Do You Lose When the Primary Model is Down?

The operating assumption was right -- but now with data

AI Collaboration

Date: 2026-02-23 Cost: ~$0.005 (judge calls only — all test inference was $0)

Hypothesis

Production applications use fallback chains — if the primary cloud model fails, fall back to a local model or a free-tier alternative. But the quality delta has never been measured. The operating assumption is “the fallback is the baseline that was already acceptable.” Is that actually true?

Method

Phase 1: Run the same 4 standardized tasks through 5 models:
- Cloud primary: Haiku 4.5, Sonnet 4.6, Opus 4.6 (reference ceiling)
- Local fallback: Qwen2.5-7B MLX, Llama-3.1-8B MLX (the application’s actual fallbacks)
- Free cloud: GPT-OSS-120B via Cerebras (tested as potential middle tier)
Phase 2: AI judge (Llama 4 Maverick, Azure) blind-scores all outputs 1-10 on: accuracy, completeness, structure, usefulness, language quality
Phase 3: Calculate deltas per fallback pair, produce verdict table

Tasks tested:

Extraction: key facts from an interview transcript
Editorial: review of a draft column
Summary: 3-4 sentence summary in a non-English language
Generation: write a 150-200 word editorial in a non-English language

Results

Model Averages

Model	Avg Overall	Avg Latency	Cost/call	Role
Opus 4.6	9.0	16.1s	$0.044	Quality ceiling
Haiku 4.5	8.8	5.1s	$0.002	Application primary (metadata)
Sonnet 4.6	8.8	11.9s	$0.008	Application primary (backvoice)
GPT-OSS-120B (Cerebras)	8.8	0.8s	$0.000	New middle tier
Qwen2.5-7B MLX	8.4	16.9s	$0.000	Application fallback
Llama-3.1-8B MLX	8.2	17.8s	$0.000	Application fallback (backvoice)

Task-Level Scores

Model	Extraction	Editorial	Summary	Generation
Opus 4.6	9	9	9	9
Haiku 4.5	9	9	8	9
Sonnet 4.6	9	9	8	9
GPT-OSS-120B (Cerebras)	8	9	9	9
Qwen2.5-7B MLX	8.5	9	9	7
Llama-3.1-8B MLX	9	8	8	8

Fallback Delta Verdicts

Haiku 4.5 to Qwen2.5-7B MLX (actual metadata fallback)

Task	Primary	Fallback	Delta	Verdict
Extraction	9	8.5	-0.5	Acceptable
Editorial	9	9	0	Acceptable
Summary	8	9	+1	Acceptable (fallback wins!)
Generation	9	7	-2	Noticeable

Bottom line: 3/4 tasks fine. Non-English generation is the weak spot — Qwen2.5-7B produces grammatically awkward output when writing from scratch. The application doesn’t generate non-English text from scratch (it cleans/summarizes existing text), so in practice this fallback is fine for its actual use case.

Haiku 4.5 to Llama-3.1-8B MLX (actual backvoice fallback)

Task	Primary	Fallback	Delta	Verdict
Extraction	9	9	0	Acceptable
Editorial	9	8	-1	Acceptable
Summary	8	8	0	Acceptable
Generation	9	8	-1	Acceptable

Bottom line: All acceptable. Max delta is -1. Llama-3.1-8B is the more reliable fallback for language-sensitive tasks than Qwen2.5-7B.

Haiku 4.5 to GPT-OSS-120B Cerebras (potential replacement)

Task	Primary	Fallback	Delta	Verdict
Extraction	9	8	-1	Acceptable
Editorial	9	9	0	Acceptable
Summary	8	9	+1	Acceptable (beats Haiku!)
Generation	9	9	0	Acceptable

Bottom line: Cerebras GPT-OSS-120B matches Haiku quality at $0 cost and 6x speed. Could replace Haiku as the primary tier for non-critical tasks.

Sonnet 4.6 to GPT-OSS-120B Cerebras (potential Sonnet replacement)

Task	Primary	Fallback	Delta	Verdict
Extraction	9	8	-1	Acceptable
Editorial	9	9	0	Acceptable
Summary	8	9	+1	Acceptable (beats Sonnet!)
Generation	9	9	0	Acceptable

Bottom line: Identical to the Haiku comparison. For plumbing and metadata tasks, GPT-OSS-120B is a viable Sonnet replacement, saving $0.008/call.

Key Findings

1. The Fallback is Genuinely Acceptable

The operating assumption was right — but now with data. The local MLX models score 8.2-8.4 vs cloud’s 8.8-9.0. That’s a 0.4-0.8 point gap on a 10-point scale. For most tasks, the end user won’t notice the difference.

2. Qwen2.5-7B Has a Non-English Generation Weakness

Scored 7/10 on generation in a non-English language — the only model below 8 on any task. It produces grammatically awkward output when writing from scratch. Fine for cleaning and summarizing existing text, not for original prose. This aligns with previous findings that Mistral-Small-24B is the best local model for non-English generation, not Qwen.

3. Cerebras GPT-OSS-120B is a Game Changer for Fallback Chains

Tied with Haiku and Sonnet at 8.8 average. $0 cost. 0.8s latency (vs Haiku’s 5.1s). The free tier (1M tokens/day) is enough for most project workloads.

Suggested new fallback chain:

Primary: Haiku/Sonnet ($0.002-0.008, 5-12s)
  | API failure
Tier 2: Cerebras GPT-OSS-120B ($0, 0.8s)  <-- NEW
  | Cerebras down
Tier 3: Local MLX Qwen2.5-7B/Llama-3.1-8B ($0, 7-22s)

4. AI Judges are Decent but Generous

All Claude models scored 8-9, all fallback models scored 7-9. The spread is narrow. Maverick doesn’t punish the local models as harshly as a human might for language quality or structural elegance. Previous evaluation attempts found the same pattern: AI judges evaluate correctness well but are generous on style.

Judge Methodology Notes

Used Llama 4 Maverick (Azure) — fast, cheap (~$0.0003/call), good at structured scoring
Single-pass blind scoring (no A/B comparison) — each output judged independently
5 dimensions + overall holistic score
Total judge cost: ~$0.005 for all 24 evaluations