Fallback Quality Delta: How Much Do You Lose When the Primary Model is Down?
The operating assumption was right -- but now with data
Date: 2026-02-23 Cost: ~$0.005 (judge calls only — all test inference was $0)
Hypothesis
Production applications use fallback chains — if the primary cloud model fails, fall back to a local model or a free-tier alternative. But the quality delta has never been measured. The operating assumption is “the fallback is the baseline that was already acceptable.” Is that actually true?
Method
-
Phase 1: Run the same 4 standardized tasks through 5 models:
- Cloud primary: Haiku 4.5, Sonnet 4.6, Opus 4.6 (reference ceiling)
- Local fallback: Qwen2.5-7B MLX, Llama-3.1-8B MLX (the application’s actual fallbacks)
- Free cloud: GPT-OSS-120B via Cerebras (tested as potential middle tier)
-
Phase 2: AI judge (Llama 4 Maverick, Azure) blind-scores all outputs 1-10 on: accuracy, completeness, structure, usefulness, language quality
-
Phase 3: Calculate deltas per fallback pair, produce verdict table
Tasks tested:
- Extraction: key facts from an interview transcript
- Editorial: review of a draft column
- Summary: 3-4 sentence summary in a non-English language
- Generation: write a 150-200 word editorial in a non-English language
Results
Model Averages
| Model | Avg Overall | Avg Latency | Cost/call | Role |
|---|---|---|---|---|
| Opus 4.6 | 9.0 | 16.1s | $0.044 | Quality ceiling |
| Haiku 4.5 | 8.8 | 5.1s | $0.002 | Application primary (metadata) |
| Sonnet 4.6 | 8.8 | 11.9s | $0.008 | Application primary (backvoice) |
| GPT-OSS-120B (Cerebras) | 8.8 | 0.8s | $0.000 | New middle tier |
| Qwen2.5-7B MLX | 8.4 | 16.9s | $0.000 | Application fallback |
| Llama-3.1-8B MLX | 8.2 | 17.8s | $0.000 | Application fallback (backvoice) |
Task-Level Scores
| Model | Extraction | Editorial | Summary | Generation |
|---|---|---|---|---|
| Opus 4.6 | 9 | 9 | 9 | 9 |
| Haiku 4.5 | 9 | 9 | 8 | 9 |
| Sonnet 4.6 | 9 | 9 | 8 | 9 |
| GPT-OSS-120B (Cerebras) | 8 | 9 | 9 | 9 |
| Qwen2.5-7B MLX | 8.5 | 9 | 9 | 7 |
| Llama-3.1-8B MLX | 9 | 8 | 8 | 8 |
Fallback Delta Verdicts
Haiku 4.5 to Qwen2.5-7B MLX (actual metadata fallback)
| Task | Primary | Fallback | Delta | Verdict |
|---|---|---|---|---|
| Extraction | 9 | 8.5 | -0.5 | Acceptable |
| Editorial | 9 | 9 | 0 | Acceptable |
| Summary | 8 | 9 | +1 | Acceptable (fallback wins!) |
| Generation | 9 | 7 | -2 | Noticeable |
Bottom line: 3/4 tasks fine. Non-English generation is the weak spot — Qwen2.5-7B produces grammatically awkward output when writing from scratch. The application doesn’t generate non-English text from scratch (it cleans/summarizes existing text), so in practice this fallback is fine for its actual use case.
Haiku 4.5 to Llama-3.1-8B MLX (actual backvoice fallback)
| Task | Primary | Fallback | Delta | Verdict |
|---|---|---|---|---|
| Extraction | 9 | 9 | 0 | Acceptable |
| Editorial | 9 | 8 | -1 | Acceptable |
| Summary | 8 | 8 | 0 | Acceptable |
| Generation | 9 | 8 | -1 | Acceptable |
Bottom line: All acceptable. Max delta is -1. Llama-3.1-8B is the more reliable fallback for language-sensitive tasks than Qwen2.5-7B.
Haiku 4.5 to GPT-OSS-120B Cerebras (potential replacement)
| Task | Primary | Fallback | Delta | Verdict |
|---|---|---|---|---|
| Extraction | 9 | 8 | -1 | Acceptable |
| Editorial | 9 | 9 | 0 | Acceptable |
| Summary | 8 | 9 | +1 | Acceptable (beats Haiku!) |
| Generation | 9 | 9 | 0 | Acceptable |
Bottom line: Cerebras GPT-OSS-120B matches Haiku quality at $0 cost and 6x speed. Could replace Haiku as the primary tier for non-critical tasks.
Sonnet 4.6 to GPT-OSS-120B Cerebras (potential Sonnet replacement)
| Task | Primary | Fallback | Delta | Verdict |
|---|---|---|---|---|
| Extraction | 9 | 8 | -1 | Acceptable |
| Editorial | 9 | 9 | 0 | Acceptable |
| Summary | 8 | 9 | +1 | Acceptable (beats Sonnet!) |
| Generation | 9 | 9 | 0 | Acceptable |
Bottom line: Identical to the Haiku comparison. For plumbing and metadata tasks, GPT-OSS-120B is a viable Sonnet replacement, saving $0.008/call.
Key Findings
1. The Fallback is Genuinely Acceptable
The operating assumption was right — but now with data. The local MLX models score 8.2-8.4 vs cloud’s 8.8-9.0. That’s a 0.4-0.8 point gap on a 10-point scale. For most tasks, the end user won’t notice the difference.
2. Qwen2.5-7B Has a Non-English Generation Weakness
Scored 7/10 on generation in a non-English language — the only model below 8 on any task. It produces grammatically awkward output when writing from scratch. Fine for cleaning and summarizing existing text, not for original prose. This aligns with previous findings that Mistral-Small-24B is the best local model for non-English generation, not Qwen.
3. Cerebras GPT-OSS-120B is a Game Changer for Fallback Chains
Tied with Haiku and Sonnet at 8.8 average. $0 cost. 0.8s latency (vs Haiku’s 5.1s). The free tier (1M tokens/day) is enough for most project workloads.
Suggested new fallback chain:
Primary: Haiku/Sonnet ($0.002-0.008, 5-12s)
| API failure
Tier 2: Cerebras GPT-OSS-120B ($0, 0.8s) <-- NEW
| Cerebras down
Tier 3: Local MLX Qwen2.5-7B/Llama-3.1-8B ($0, 7-22s)
4. AI Judges are Decent but Generous
All Claude models scored 8-9, all fallback models scored 7-9. The spread is narrow. Maverick doesn’t punish the local models as harshly as a human might for language quality or structural elegance. Previous evaluation attempts found the same pattern: AI judges evaluate correctness well but are generous on style.
Judge Methodology Notes
- Used Llama 4 Maverick (Azure) — fast, cheap (~$0.0003/call), good at structured scoring
- Single-pass blind scoring (no A/B comparison) — each output judged independently
- 5 dimensions + overall holistic score
- Total judge cost: ~$0.005 for all 24 evaluations