Context Beats Compute: When a Glossary Beats a Better Model
An 870-token text file that cost nothing and took thirty minutes to write
A categorization tool sorts incoming ideas into seven custom categories. The cheap AI model dumps half of them into “miscellaneous.” The obvious fix: upgrade to a model four times the price. But the actual fix was an 870-token text file that cost nothing and took thirty minutes to write.
This is a story about the most common mistake in applied AI: reaching for a bigger model when the problem is a thinner prompt.
The problem
The application takes messy, voice-transcribed ideas and assigns each one to one of seven categories. These aren’t generic labels like “work” or “personal” — they’re domain-specific buckets with names that only make sense if you already know the system. Think categories like “content pipeline” or “infrastructure” where the boundaries are obvious to the person who built them but opaque to a language model seeing them for the first time.
Four models were tested against twenty labeled ideas, using only the bare category names as guidance:
| Model | Category match | Failure mode |
|---|---|---|
| Sonnet | 80% | Reasonable disagreements |
| Maverick | 70% | One real error |
| Qwen 3.5+ | 60% | Dumps to misc |
| Haiku | 50% | Aggressively dumps to misc |
Haiku — the fastest and cheapest option — got half of them wrong, almost entirely by shoving ideas into the catch-all “miscellaneous” category. It didn’t misunderstand the ideas. It didn’t hallucinate categories. It just couldn’t map unfamiliar domain terms to unfamiliar category names, so it gave up and picked the safe default.
Sonnet, at roughly four times the cost per token, managed 80%. The naive conclusion: domain classification needs a big model. Pay more, get better results. Move on.
That conclusion was wrong.
The actual fix
Instead of upgrading the model, the next step was writing a context file — a plain text glossary, 870 tokens long. It contained three things:
-
Richer category descriptions. Two to three lines per category instead of one. Not creative prose — just enough specificity to distinguish “this belongs here” from “this could go anywhere.”
-
A domain glossary. Twenty-five terms mapped to their correct categories. Things like project names, tool names, and abbreviations that appear constantly in the input but mean nothing without context.
-
Transcription quirk notes. Voice-transcribed input garbles proper nouns in predictable ways. “11 laps” means ElevenLabs. A human knows this. A model without context does not.
Total cost: zero dollars. Time investment: about thirty minutes of thinking through what a new team member would need to know to do this classification correctly. That framing — “what would you tell a human doing this job for the first time?” — turns out to be the right question every time.
The results
The same four models, the same twenty ideas, now with the 870-token context file prepended to the prompt:
| Model + context | Accuracy (corrected ground truth) | Change |
|---|---|---|
| Sonnet + context | ~95% | Still best, but the gap shrank dramatically |
| Haiku + context | ~80% | From 50% to 80% — misc dumping eliminated |
| Maverick + context | ~80% | From 70% to 80% |
| Qwen + context | ~80% | From 60% to 80% |
Haiku went from worst to tied-for-second. The spread between the cheapest and most expensive model collapsed from thirty percentage points to fifteen. Three models that previously ranged from mediocre to useless all converged at the same accuracy level.
The context didn’t make the models smarter. It made the task easier. Every model already had the reasoning ability to do this classification — they just lacked the domain knowledge to apply it. The glossary bridged that gap for almost nothing.
The bonus finding: the expensive model was wrong too
Here’s the twist that makes this more than an optimization story.
Four to five of the twenty “ground truth” labels were actually wrong. The original labels had been assigned by the expensive model (Sonnet) in an earlier run — without the context file. It had made the same domain-knowledge mistakes that the cheaper models made later.
With the context file, all four models agreed on the correct classification for those items, contradicting the stored labels. The context didn’t just help the cheap model catch up to the expensive one. It exposed errors that the expensive model had made when it was flying blind.
This is worth sitting with for a moment. The ground truth itself was contaminated by the same context gap. When you evaluate models against labels that were generated without sufficient context, you’re measuring which model best replicates the original model’s mistakes. Add context, and the models don’t just get better — they get more correct than your benchmark.
When to use this pattern
This works best in a specific family of problems:
-
Domain-specific classification where categories use terminology that isn’t self-explanatory. If your labels are “positive / negative / neutral,” you don’t need a glossary. If they’re “pipeline / infrastructure / content-ops,” you probably do.
-
Voice transcription inputs where proper nouns, product names, and technical terms get garbled in predictable ways. A glossary of common transcription errors is cheap insurance.
-
Personal tools where the categories were designed for one person’s mental model. What’s obvious to the builder is opaque to a model — and often to other humans.
-
Any time before reaching for a bigger model. If a smaller model is failing, ask why before assuming it’s too dumb. “Not enough context” and “not enough capability” produce similar-looking failures but have very different fixes.
The thirty-minute test: write down what you’d tell a competent human doing this task for the first time. If that explanation is longer than what’s in your prompt, you’ve found your problem.
When model size still matters
The context file is not a universal solvent. That remaining fifteen-point gap between Haiku and Sonnet is real, and there are tasks where it matters:
-
Creative generation — voice, style, humor, the ability to produce something that sounds like it was written by a specific person. Smaller models plateau here regardless of context.
-
Ambiguity reasoning — when the task requires judgment under genuine uncertainty, not just term mapping. The context file helps with “I don’t know what this word means.” It doesn’t help with “I can see two valid interpretations and need to pick the more likely one.”
-
Long-tail edge cases — the context file handled the common domain terms. There will always be novel inputs that require the kind of broad reasoning that larger models do better. The question is whether those edge cases justify four times the cost for every request, or whether you handle them differently.
For classification specifically, the fifteen-point gap often doesn’t matter. Eighty percent accuracy at a quarter of the cost is the right trade for many applications, especially when the remaining twenty percent can be caught by a human review queue or a confidence threshold that escalates uncertain cases.
The principle
The first round of testing produced a clean, intuitive conclusion: domain classification needs big models. The cheap models fail, the expensive models work, case closed.
The second round disproved it. Domain classification needs domain context. The cheap models weren’t failing because they lacked reasoning power — they were failing because they lacked information. An 870-token text file fixed it.
This is a general principle with broad application: before you upgrade the model, upgrade the prompt. Not with clever prompt engineering tricks or elaborate chain-of-thought scaffolding — just with the domain knowledge that the task actually requires. Write down what you know that the model doesn’t. It’s almost always cheaper than the alternative.
The cost ladder in AI tooling — from Haiku at fractions of a cent to Sonnet at several cents to Opus at tens of cents — is real and meaningful at scale. But the biggest cost savings don’t come from negotiating better rates or finding cheaper providers. They come from realizing that the expensive model was compensating for missing context, and that adding the context back is free.
An 870-token glossary. Thirty minutes of work. Zero dollars.
That’s the cheapest possible fix, and it should always be the first thing you try.