Snoar Does Not Rhyme With Snolar
Azure is forty-five times cheaper than OpenAI, local models all fail, and the speed wobble is real
Making a computer speak English is solved. Natural voices, convincing prosody, free to cheap. Making a computer speak Swedish involves unexpected price ratios, pronunciation traps no documentation warns about, and local models that sound like they learned Swedish from an airport phrasebook.
I tested every viable TTS option for Swedish — cloud, local, and commercial. The landscape is narrower than expected.
Cloud
Three worth considering. Not interchangeable.
Azure is the production workhorse. Neural voices — Sofie and Hillevi — produce natural Swedish with correct prosody. The cost: roughly $0.0003 per segment. A six-hour audio production costs about $5.86. Generating speech for an entire podcast series costs less than the electricity to run the laptop that edits it.
OpenAI produces arguably better output — more emotional range, better complex sentences. 5.6 times the price. For dozens of episodes, that multiplier moves from rounding error to line item. The gap has narrowed with Azure’s latest neural updates.
ElevenLabs is premium — voice cloning, fine-grained emotion, highest naturalness. Also the most expensive by far. The voice cloning is genuinely impressive: a few minutes of sample audio produces output casual listeners can’t distinguish from the original. Whether that’s a feature or a concern depends on your use case.
Local: the wasteland
Every local model tested for Swedish failed. Not “lower quality.” Failed.
Kokoro — most promising open-source TTS — handles English beautifully. Its Swedish sounds like an English speaker reading Swedish words phonetically. Vowel lengths wrong, pitch accent absent (the tonal difference between “anden” the duck and “anden” the spirit), compound words stressed wrong.
The gap between local and cloud isn’t the gradual 0.4-point delta you see in text generation. It’s binary: cloud voices sound like someone speaking Swedish, local voices sound like a machine reading a Swedish dictionary. For Swedish TTS, cloud isn’t a preference — it’s a requirement.
The gotchas
Even the best cloud voices make mistakes no documentation warns about.
Azure pronounces “snoar” (snows) as “snolar.” No native speaker would do this. Place names are worse — Swedish geography is full of names whose pronunciation has no relationship to their spelling, and TTS engines guess. The guesses are usually wrong.
The fix is SSML — explicit phoneme control for specific words. You maintain a pronunciation dictionary of known failures. It grows with every new text.
A related discovery: Azure’s Swedish voices have a speed wobble. Certain sentences cause slight acceleration or deceleration that sounds unnatural. Setting rate to 95% of default fixes most of it. Not documented anywhere. Found by listening to hundreds of generated segments.
The decision
For Swedish TTS in 2026:
Production audio: Azure neural voices. Negligible cost, broadcast quality, manageable pronunciation gotchas with an SSML dictionary. OpenAI as second pass for emotional segments.
Prototype and testing: Azure again. Cost is low enough to generate test audio every iteration.
Voice cloning: ElevenLabs, if budget allows.
Local or offline: Don’t. If internet is unavailable, defer. Local Swedish TTS undermines the content it reads.
The computers are learning to speak Swedish. Not fluent yet. But at forty-five times cheaper than premium, Azure is fluent enough — as long as you keep a pronunciation dictionary handy and don’t ask it to say “snoar.”