essay 2026-03-24

Fine-Tuning the Hard Way: Two Failures and What They Taught Us

Two attempts. Two failures. Same root cause both times.

AI Collaboration

You have free cloud credits expiring in two weeks. The platform offers fine-tuning for a dozen open-weight models. You have a corpus of Swedish text sitting on disk, already OCR’d, already in roughly the right format. The question writes itself: can we train a model that writes like a Swedish newspaper?

Two attempts. Two failures. Same root cause both times.

This is the story of what went wrong, what it cost, and what we learned about when fine-tuning makes sense — and when it’s just expensive procrastination.

Attempt 1: 19th-Century Literature

The idea was simple. We had 19,000 examples of Swedish literary text — Almqvist, Wagner, other 19th-century authors — extracted from digitized books via OCR. The base model was Phi-4-mini, a compact 3.8 billion parameter model. The hypothesis: fine-tune it on historical Swedish prose and get a model that could generate text in that register.

The result was strictly worse than the base model on every dimension. Coherence, style fidelity, prompt-following — all degraded. The fine-tuned model didn’t learn to write like Almqvist. It learned to reproduce OCR artifacts and output random text fragments.

Three things went wrong simultaneously.

Dirty data. OCR on 19th-century printed Swedish is an ugly business. Fraktur typefaces, yellowed pages, inconsistent typesetting — the OCR pipeline produced text littered with Cyrillic character substitutions, broken words, and encoding noise. We ran a cleaning pipeline, but the artifacts that survived were exactly the kind of subtle corruption that a small model will faithfully memorize. When your training data teaches the model that “кönig” is how you spell “konung,” the model believes you.

Mismatched prompt-response pairs. We paired random writing prompts with arbitrary text chunks. The training format said “here’s a prompt, here’s what you should generate,” but the actual content had no relationship between the two. The model learned the only consistent pattern in the data: ignore whatever the prompt says and continue outputting random text. This is instruction-tuning’s version of garbage in, garbage out — you’re not just corrupting the output, you’re actively training the model to disregard instructions.

Quantity over quality. 19,000 noisy examples is a lot of signal for a 3.8 billion parameter model to absorb. With clean, well-structured data, that volume could meaningfully shift the model’s behavior. With noisy data, it overwhelmed the model’s existing instruction-following ability. The fine-tuned model didn’t add Swedish literary style on top of Phi-4-mini’s capabilities — it replaced those capabilities with noise.

The whole thing took about a day to set up and run. The training cost was negligible. But the result was unambiguous: the fine-tuned model was worse than just prompting the base model in Swedish.

Attempt 2: Newspaper Articles

A month later, different project, same itch. This time we had a corpus of articles from a Swedish newspaper — 782 examples extracted from OCR’d PDFs of recent issues. The base model was Ministral-3B, Mistral’s compact offering. The hypothesis was more modest: train a model that could generate text in the style of a specific Swedish local newspaper.

We ran three training configurations — 3 epochs, 5 epochs, and 10 epochs — with slightly different learning rates. All three completed successfully. The 3-epoch model was the only one we managed to evaluate before events overtook us (more on that shortly), and the result was… almost promising. It generated Swedish text. Real Swedish words in grammatically plausible sequences. But within a few sentences, every generation collapsed into repetition loops — the same phrase or sentence fragment repeating until it hit the token limit.

What went wrong this time was both different and identical.

42% ad contamination. When you OCR a newspaper, you get everything on the page — not just the editorial content. Classified ads, display advertisements, event listings, subscription offers. Our extraction pipeline used heuristic filters to separate editorial from advertising, but heuristics only go so far. Nearly half the training examples contained advertising content mixed with or mistaken for editorial text. The model learned that newspaper writing involves repeating short phrases with slight variations — because that’s what ads do.

Same root cause. Strip away the details and both attempts failed for the same reason: we trained on data we hadn’t properly validated. In attempt 1, the corruption was character-level (OCR artifacts). In attempt 2, it was content-level (wrong type of text). Both times, we assumed the data was clean enough because we’d run some automated filtering. Both times, “clean enough” wasn’t.

The three training runs all completed, each costing a few dollars. But here’s where the second failure compounded: we never properly evaluated runs 2 and 3. The cloud resource group was deleted before we got to it. Those models — which might have been better, might have been worse, might have shown us something useful about how epochs interact with noisy data — simply ceased to exist.

The Compounding Mistake

Both attempts shared an anti-pattern that’s worth naming explicitly: credit-burning as motivation.

“We have free cloud credits expiring soon” is a terrible reason to start a machine learning experiment. It creates urgency where none should exist. It substitutes “use the resource before it disappears” for “test a specific hypothesis with a clear evaluation plan.” Both of our fine-tuning attempts were conceived and executed under this pressure, and both times it led to the same shortcut: skipping data validation to get the training job submitted before the credits expired.

Cheap experiments are genuinely good. One of the best things about the current AI landscape is that you can fine-tune a 3-billion-parameter model for the cost of a coffee. But cheap and unplanned are different things. An experiment without a success metric, without validated data, without an evaluation plan — that’s not a cheap experiment, it’s waste at any price.

The irony is that the urgency was real but misplaced. The credits were expiring, yes. But the expensive part of fine-tuning isn’t the training.

The Cost Anatomy

Here’s something that surprised us, and that most fine-tuning tutorials don’t emphasize enough.

Fine-tuning itself is cheap. Each of our training runs cost between $2 and $6 for a 3B parameter model on 782 examples. Even the larger run on 19,000 examples with Phi-4-mini was in the same ballpark. At these prices, you could run dozens of experimental configurations for the cost of dinner.

The expensive part is hosting. Once you’ve fine-tuned a model on a cloud platform, you need a deployed endpoint to actually use it. Our endpoints cost roughly $0.80 per hour each. That doesn’t sound like much until you realize it’s running 24/7 — nearly $20 per day per endpoint. We had two endpoints running simultaneously for testing.

And the most expensive part of all is neither training nor hosting. It’s the $50 we spent on training and hosting models that we never properly evaluated. $20 in training costs across five runs, $30 in endpoint hosting for models we poked at briefly and then deleted. Not because the money was significant in absolute terms, but because we got zero usable information out of it. We didn’t learn whether more epochs helped. We didn’t learn whether the ad contamination was the binding constraint. We didn’t learn whether Ministral-3B was a better base than Phi-4-mini for Swedish text. We spent the money and came away with nothing but the lesson that we’d wasted it.

For comparison: regular serverless inference — the pay-per-token kind — is essentially free at our volumes. You’d need to generate millions of tokens to match one hour of endpoint hosting. The economics of fine-tuning are almost entirely about whether you can justify the hosting cost, not the training cost.

What We’d Require Before Fine-Tuning Again

After two failures with the same root cause, we wrote down five requirements. Not aspirational guidelines — hard prerequisites. No fine-tuning job gets submitted until all five are met.

1. Clean data first. Model-assisted cleaning, not just heuristic filters. Budget a full working session for data preparation before touching the training API. Use a capable language model to review and filter the training examples — it’s better at catching subtle quality issues than any regex pipeline. If you can’t afford the API calls to clean your data, you can’t afford to train on dirty data.

2. Defined success metric. “Generate Swedish text” is not a metric. Before training, answer: what specific task will the fine-tuned model perform, how will you evaluate the output, and what’s the baseline you’re comparing against? If the answer is “we’ll know it when we see it,” you’re not ready to fine-tune.

3. Smoke test at 50 examples. This was the cheapest insight from the whole experience. If a model shows repetition loops or degradation when fine-tuned on 50 clean examples, it will not improve at 800. Don’t scale up bad data. Run a tiny training job, evaluate it seriously, and only proceed if the results are directionally correct. At $2-6 per run, this costs almost nothing.

4. Download plan. Cloud fine-tunes on managed platforms often can’t be exported. The models are hosted as proxy endpoints — you can call them via API, but you can’t download the weights. When the resource group goes, the models go. We learned this the hard way when our second and third Ministral training runs were deleted before evaluation. If you want to keep the model, fine-tune on a platform where you own the weights — HuggingFace, a GPU rental service, or locally if your hardware supports it.

5. No credit-burning as motivation. This is the meta-requirement. If the primary motivation for a fine-tuning experiment is “use expiring credits,” stop and examine whether you actually have a hypothesis worth testing. Expiring credits are a fine opportunity — but the experiment needs to stand on its own merits. Urgency and rigor are not friends.

The Broader Principle

Here’s what both failures were really teaching us, stated as plainly as possible: don’t fine-tune until your prompts are perfect.

This sounds like a platitude, but it has a precise technical meaning. Prompt engineering combined with context injection — giving the model examples, style guides, and reference text at inference time — gets you roughly 80% of the way to any specialized text generation task, at essentially zero cost. Every major model can generate competent Swedish prose if you prompt it well. Every model can mimic a newspaper’s voice if you give it examples in-context.

Fine-tuning is for the last 20%. It’s for when you’ve exhausted what prompting can do and you need the model to internalize a pattern so deeply that it doesn’t need to be reminded every time. That’s a real use case — latency-sensitive applications, cost optimization at scale, tasks where the context window is too precious to spend on style examples. But it’s a use case that only makes sense after you’ve proven that prompting alone isn’t sufficient.

We skipped that step both times. We went straight to fine-tuning because we had the data and the credits, not because we’d hit the ceiling of what prompting could do. In both cases, a well-prompted base model would have outperformed our fine-tuned versions — and for free.

The sequence should be: prompt engineering first, iterate until you’ve found the limits, document what prompting can’t achieve, and only then consider fine-tuning to close the gap. And when you do fine-tune, bring clean data, a clear metric, and a smoke test. Everything else is setting money on fire in an interesting way.

We’re not done with fine-tuning. The technique is sound, the costs are manageable, and there are genuine use cases waiting. But next time, we’ll come prepared. The data will be clean. The metric will be defined. And the motivation will be better than “the credits expire on Tuesday.”

This essay is based on experiments conducted in early 2026 using Azure AI Foundry’s fine-tuning infrastructure. The models, costs, and platform details reflect that moment in time — the landscape moves fast, but dirty data is eternal.