experiment 2026-02-23

Fallback Quality Delta: How Much Do You Lose When the Primary Model is Down?

The operating assumption was right -- but now with data

Date: 2026-02-23 Cost: ~$0.005 (judge calls only — all test inference was $0)

Hypothesis

Production applications use fallback chains — if the primary cloud model fails, fall back to a local model or a free-tier alternative. But the quality delta has never been measured. The operating assumption is “the fallback is the baseline that was already acceptable.” Is that actually true?

Method

  1. Phase 1: Run the same 4 standardized tasks through 5 models:

    • Cloud primary: Haiku 4.5, Sonnet 4.6, Opus 4.6 (reference ceiling)
    • Local fallback: Qwen2.5-7B MLX, Llama-3.1-8B MLX (the application’s actual fallbacks)
    • Free cloud: GPT-OSS-120B via Cerebras (tested as potential middle tier)
  2. Phase 2: AI judge (Llama 4 Maverick, Azure) blind-scores all outputs 1-10 on: accuracy, completeness, structure, usefulness, language quality

  3. Phase 3: Calculate deltas per fallback pair, produce verdict table

Tasks tested:

  • Extraction: key facts from an interview transcript
  • Editorial: review of a draft column
  • Summary: 3-4 sentence summary in a non-English language
  • Generation: write a 150-200 word editorial in a non-English language

Results

Model Averages

Model Avg Overall Avg Latency Cost/call Role
Opus 4.6 9.0 16.1s $0.044 Quality ceiling
Haiku 4.5 8.8 5.1s $0.002 Application primary (metadata)
Sonnet 4.6 8.8 11.9s $0.008 Application primary (backvoice)
GPT-OSS-120B (Cerebras) 8.8 0.8s $0.000 New middle tier
Qwen2.5-7B MLX 8.4 16.9s $0.000 Application fallback
Llama-3.1-8B MLX 8.2 17.8s $0.000 Application fallback (backvoice)

Task-Level Scores

Model Extraction Editorial Summary Generation
Opus 4.6 9 9 9 9
Haiku 4.5 9 9 8 9
Sonnet 4.6 9 9 8 9
GPT-OSS-120B (Cerebras) 8 9 9 9
Qwen2.5-7B MLX 8.5 9 9 7
Llama-3.1-8B MLX 9 8 8 8

Fallback Delta Verdicts

Haiku 4.5 to Qwen2.5-7B MLX (actual metadata fallback)

Task Primary Fallback Delta Verdict
Extraction 9 8.5 -0.5 Acceptable
Editorial 9 9 0 Acceptable
Summary 8 9 +1 Acceptable (fallback wins!)
Generation 9 7 -2 Noticeable

Bottom line: 3/4 tasks fine. Non-English generation is the weak spot — Qwen2.5-7B produces grammatically awkward output when writing from scratch. The application doesn’t generate non-English text from scratch (it cleans/summarizes existing text), so in practice this fallback is fine for its actual use case.

Haiku 4.5 to Llama-3.1-8B MLX (actual backvoice fallback)

Task Primary Fallback Delta Verdict
Extraction 9 9 0 Acceptable
Editorial 9 8 -1 Acceptable
Summary 8 8 0 Acceptable
Generation 9 8 -1 Acceptable

Bottom line: All acceptable. Max delta is -1. Llama-3.1-8B is the more reliable fallback for language-sensitive tasks than Qwen2.5-7B.

Haiku 4.5 to GPT-OSS-120B Cerebras (potential replacement)

Task Primary Fallback Delta Verdict
Extraction 9 8 -1 Acceptable
Editorial 9 9 0 Acceptable
Summary 8 9 +1 Acceptable (beats Haiku!)
Generation 9 9 0 Acceptable

Bottom line: Cerebras GPT-OSS-120B matches Haiku quality at $0 cost and 6x speed. Could replace Haiku as the primary tier for non-critical tasks.

Sonnet 4.6 to GPT-OSS-120B Cerebras (potential Sonnet replacement)

Task Primary Fallback Delta Verdict
Extraction 9 8 -1 Acceptable
Editorial 9 9 0 Acceptable
Summary 8 9 +1 Acceptable (beats Sonnet!)
Generation 9 9 0 Acceptable

Bottom line: Identical to the Haiku comparison. For plumbing and metadata tasks, GPT-OSS-120B is a viable Sonnet replacement, saving $0.008/call.

Key Findings

1. The Fallback is Genuinely Acceptable

The operating assumption was right — but now with data. The local MLX models score 8.2-8.4 vs cloud’s 8.8-9.0. That’s a 0.4-0.8 point gap on a 10-point scale. For most tasks, the end user won’t notice the difference.

2. Qwen2.5-7B Has a Non-English Generation Weakness

Scored 7/10 on generation in a non-English language — the only model below 8 on any task. It produces grammatically awkward output when writing from scratch. Fine for cleaning and summarizing existing text, not for original prose. This aligns with previous findings that Mistral-Small-24B is the best local model for non-English generation, not Qwen.

3. Cerebras GPT-OSS-120B is a Game Changer for Fallback Chains

Tied with Haiku and Sonnet at 8.8 average. $0 cost. 0.8s latency (vs Haiku’s 5.1s). The free tier (1M tokens/day) is enough for most project workloads.

Suggested new fallback chain:

Primary: Haiku/Sonnet ($0.002-0.008, 5-12s)
  | API failure
Tier 2: Cerebras GPT-OSS-120B ($0, 0.8s)  <-- NEW
  | Cerebras down
Tier 3: Local MLX Qwen2.5-7B/Llama-3.1-8B ($0, 7-22s)

4. AI Judges are Decent but Generous

All Claude models scored 8-9, all fallback models scored 7-9. The spread is narrow. Maverick doesn’t punish the local models as harshly as a human might for language quality or structural elegance. Previous evaluation attempts found the same pattern: AI judges evaluate correctness well but are generous on style.

Judge Methodology Notes

  • Used Llama 4 Maverick (Azure) — fast, cheap (~$0.0003/call), good at structured scoring
  • Single-pass blind scoring (no A/B comparison) — each output judged independently
  • 5 dimensions + overall holistic score
  • Total judge cost: ~$0.005 for all 24 evaluations