essay 2026-03-31

Thirty-Six Models on a Laptop

Running local AI on Apple Silicon in rural Sweden — what works, what does not, and where the ceiling is

Read

AI Collaboration

There is a MacBook in a village in northern Sweden with thirty-six AI models cached on disk. Not API endpoints. Not cloud services. Actual model weights, sitting on an SSD, running inference on a single Apple Silicon chip with 32 gigabytes of unified memory. The village has unreliable internet. The models do not care.

This is what local AI looks like in early 2026: a laptop that can run everything from a 3-billion-parameter speed demon at 63 tokens per second to a 32-billion-parameter reasoning model at 6.5 tokens per second. The quality ranges from “surprisingly good” to “not good enough.” The interesting part is knowing which is which, and for what.

The speed-quality curve

The fundamental tradeoff with local models is the same one that governs everything in computing: you can have it fast, or you can have it good. With local inference on consumer hardware, the relationship is brutally direct.

At the small end, 3B parameter models (Qwen3-8B, various Llama variants) run at 20-60 tokens per second. This is fast enough to feel interactive — responses appear in real time, streaming feels natural, and batch processing of hundreds of items is practical. The quality is adequate for classification, extraction, and simple summarization. It falls apart on complex reasoning, nuanced language tasks, and anything requiring genuine understanding rather than pattern matching.

At the large end, 27B-32B parameter models (Gemma-3-27B, Qwen2.5-32B) run at 6-8 tokens per second. This is usably fast for single queries but too slow for batch work. The quality is genuinely good — Gemma-3-27B produces literary-quality Swedish prose, and Qwen2.5-32B is the only local model that reliably outputs clean JSON without markdown wrapping or schema violations.

In between, there is nothing. The 7-8B models at 15-25 tokens per second occupy a middle ground that is neither fast enough for batch nor good enough for quality-critical work. They exist, they work, but the decision framework almost never lands on them: if speed matters, drop to 3B; if quality matters, go to 27B+ or call a cloud API.

What works locally

OCR and document processing. Apple Silicon’s neural engine handles document-related tasks with zero API cost. Combined with tools like Azure Document Intelligence for the initial OCR pass, local models can classify, extract, and summarize document content without touching the internet. For a newspaper operation processing hundreds of pages per issue, this is not a nicety — it is the difference between a viable workflow and a prohibitively expensive one.

Classification with context. One of the more counterintuitive findings from the lab: local models are bad at classification from first principles but good at classification with a reference glossary. Haiku — a cloud model — went from 50% accuracy to 80% with an 870-token glossary injected into the prompt. The same pattern works locally. Qwen2.5-7B with domain context approaches the accuracy of much larger cloud models. The context does the heavy lifting; the model just needs to be good enough to read it.

Structured output (Qwen2.5-32B only). If you need JSON output from a local model, Qwen2.5-32B is currently the only reliable option. Every other model tested wraps JSON in markdown code fences, adds explanatory text around the JSON, or violates the schema in creative ways. Qwen2.5-32B produces clean, parseable JSON consistently. This single capability makes it indispensable for any local pipeline that feeds into automated processing.

Vision tasks (with caveats). Local vision models (Qwen2.5-VL, Llama-3.2-Vision) can describe images, read text from screenshots, and perform basic visual analysis. The critical gotcha: you must set num_images=1 in the configuration or the model allocates memory for multiple images and crashes. The quality is adequate for “what is in this image” tasks and poor for anything requiring fine-grained detail or precise text extraction.

Batch processing. Any task where you need to run the same operation across hundreds or thousands of items — and the quality threshold is “good enough, not perfect” — belongs on local inference. The cost is zero (electricity aside), the throughput scales linearly with how many items you have, and there are no rate limits, no API keys, and no internet dependency.

What does not work locally

Swedish prose generation. With one exception, local models produce Swedish text that ranges from grammatically plausible to subtly wrong. The exception is Gemma-3-27B-qat (Google’s quantization-aware trained variant), which produces literary-quality Swedish at 7 tokens per second. Every other model in the 7-32B range generates text that a native speaker would identify as “written by someone who learned Swedish from a textbook.” The gap between Gemma and everything else for Swedish is large enough that it is effectively a single-model capability.

Complex reasoning. Tasks that require multi-step logical inference, careful weighing of evidence, or maintaining consistency across long outputs exceed what local models can do reliably. This is not a tuning problem — it is a parameter count problem. The models are too small to hold the representational complexity that these tasks require. Cloud models in the 70B-400B+ range handle these tasks because they have the capacity. Local models at 3-32B do not.

Reliable instruction following on long prompts. As prompt length increases, local models’ ability to follow all instructions degrades faster than cloud models. A 3B model given a 2000-token prompt with five formatting constraints will reliably follow three of them. Which three varies unpredictably. This makes local models unreliable for any task where the full set of instructions must be followed — editorial review with a style guide, structured output with many fields, or multi-constraint generation.

The Qwen3 thinking tag problem

One gotcha deserves special mention because it affects both local and cloud deployments of the same model family.

Qwen3 models — across sizes, across providers — leak chain-of-thought reasoning tags into their output. The model’s internal thinking process, wrapped in <think> tags, appears in the response text. This is not an MLX artifact; it happens on Groq’s hosted Qwen3 as well. The tags contain the model’s reasoning steps, which are sometimes interesting but always unwanted in production output.

The fix is post-processing: strip everything between <think> and </think> tags before using the output. This works but adds complexity to every pipeline that uses a Qwen3 model, and the tags occasionally appear in malformed variations that simple regex does not catch. It is a systematic issue with the model family, not a deployment configuration problem.

The decision framework

After testing 36 models across dozens of task types, the decision of when to use local versus cloud reduces to three questions:

Is the task batch or interactive? Batch tasks with flexible quality thresholds belong locally. Interactive tasks where response quality is immediately visible to a user belong in the cloud.

Does the output feed a human or a machine? Human-facing output needs the quality that only larger models provide. Machine-facing output (JSON for a pipeline, classifications for a database, extractions for further processing) often works fine from local models, especially with context injection.

Is the internet available and reliable? This sounds trivial but is not, in practice. Rural infrastructure means the internet is sometimes unavailable for hours. Local models are the only option during outages — and knowing which tasks they handle well enough to keep working is the difference between lost time and continued productivity.

The fallback chain that emerged from this work:

Cloud API (Haiku/Sonnet) — best quality, costs money, needs internet
  | API failure or internet down
Groq free tier — nearly as good, free, needs internet
  | Groq down or rate-limited
Local MLX (Qwen2.5-7B or Gemma-3-27B) — good enough, free, no internet needed

The quality delta between the top and bottom of this chain is 0.4-0.8 points on a 10-point scale, as measured across standardized tasks. That is small enough that most end users would not notice the difference. It is large enough that quality-critical tasks — final editorial passes, user-facing content, anything with your name on it — should use the cloud tier when available.

The inventory problem

Thirty-six cached models consume disk space. The current inventory uses roughly 180 gigabytes. Not all of these models are necessary — many were downloaded for testing, benchmarked once, and never used again.

The cleanup question is always the same: which models can be deleted without losing a capability? The answer requires maintaining a mental map of which model is the best at each task. Delete Qwen2.5-32B and you lose the only reliable local JSON producer. Delete Gemma-3-27B-qat and you lose the only good local Swedish writer. Delete everything in the 7-8B range and you lose… nothing you cannot get from either the 3B models (for speed) or the 27B+ models (for quality).

A recent cleanup freed 44 gigabytes by removing models that were either redundant (multiple Llama variants doing the same thing) or outclassed (Ollama models superseded by faster MLX equivalents). The principle: keep exactly one model per capability niche. Two models that are both “pretty good at classification” is one model too many.

What changes next

The local model landscape moves fast. New quantization techniques compress larger models into the memory budget of smaller ones. New architectures trade parameter count for inference efficiency. Apple ships new hardware annually that shifts the speed-quality curve.

The framework survives these changes even if the specific model recommendations do not. The questions remain the same: batch or interactive, human or machine, internet or no internet. The answers will shift as 27B models reach the speeds that 7B models have today, or as 70B models fit into 32 gigabytes of memory. When that happens, some tasks that currently require cloud inference will migrate to local. The cloud tier will retreat to the tasks that only the largest models can handle.

Until then: thirty-six models on a laptop, six to eight of them genuinely useful, and the rest are benchmarks waiting to be deleted.