essay 2026-03-24

Agent Swarms Are Research Tools, Not Writing Tools

What a 7-episode production session taught us about where AI parallelism actually helps

AI Collaboration

What a 7-episode production session taught us about where AI parallelism actually helps — and where it quietly fails.

We set out to produce seven podcast episodes in a single session. The plan was to throw AI agents at every phase of the pipeline: research, writing, review, rewriting, and audio generation. Fifteen exploration agents would fan out across a codebase, a newspaper archive, and the open web, hunting for material. Parallel writing agents would draft episodes simultaneously. Review agents would score and critique. Rewrite agents would fix what the reviewers flagged. Then text-to-speech would turn it all into about three hours of finished audio.

Most of this worked. Some of it worked spectacularly. But the session exposed a sharp line between what agent swarms are good at and what they are not. Swarms are extraordinary research tools. They are mediocre writing tools. That distinction matters for anyone trying to use AI agents for creative production, and getting it wrong wastes both money and quality.

The experiment

The pipeline looked like this: pick topics from a series spec, launch parallel writing agents, run two review agents per episode (one for narrative quality, one for technical accuracy), compile findings, launch rewrite agents with specific fix lists, lint everything, generate audio, and push to a feed. The full cycle for seven episodes took one long session.

The interesting part was the front end — the exploration swarm. Before any writing happened, fifteen agents were dispatched to search for episode material. Each agent had a different territory: one searched the project’s own codebase and experiment history, another searched a digitized newspaper archive, others hit the web with research queries. They ran in parallel, in the background, and reported back with whatever they found.

What the swarms found

Six of the fifteen agents returned useful material. The other nine hit permission walls — a known infrastructure issue with how subagents inherit file access — and returned nothing. A sixty percent failure rate sounds bad, and it is bad operationally. But the six that worked surfaced material that no amount of directed thinking would have produced. That is the argument for swarms, right there: they find things you did not know to look for.

From a newspaper archive, one agent returned with a medieval Anatolian carpet found in a remote Swedish church — a trade route mystery spanning continents and centuries. Another found seventy years of continuous citizen science: a manual trout count maintained since 1950, making it one of Sweden’s longest-running ecological datasets. A third discovered a forgotten private radio network called WESTEL, a piece of analog shadow infrastructure that once threaded through rural communities and then vanished.

From the web, agents surfaced genuinely surprising material. Nobody has solved the bicycle stability problem — a 2011 paper in Science systematically debunked every existing theory for why bicycles balance, and the question remains open. A slime mold, given oat flakes placed at the locations of Tokyo rail stations, independently reproduced the city’s rail network. Braess’s paradox — the mathematically proven phenomenon where closing roads reduces traffic congestion — has been validated in Stuttgart, New York, and Seoul. And MOCAS, a 1958 COBOL system, still manages $1.3 trillion in Pentagon contracts.

From the codebase itself, agents turned up unexpected connections — a forgotten experiment file from an earlier project phase, an old creative writing fragment, a piece of archived research nobody remembered existed. None of these were things anyone would have searched for deliberately. The swarm found them because it was casting a wide, undirected net, and undirected nets catch things that directed searches miss.

This is what swarms do well. They are parallel exploration engines. Each agent has its own context, its own search strategy, and its own serendipity. Six independent perspectives looking at six different source pools will surface combinations that a single researcher, however talented, would never assemble. The newspaper archive finds alone — a carpet, a fish count, a radio network — could fuel three entirely different episodes, and none of them came from a topic list. They came from an agent reading through digitized pages and noticing what was interesting.

What the swarms could not do

Write with a sustained voice.

The agent-written episodes were competent. Structurally sound. Consistent with the series spec. They hit their target lengths, followed the formatting rules, and organized their arguments logically. By most measurable criteria, they were good.

But the best episode of the session was written manually, by one entity working with full context — the series spec, the research material, the audience model, the tonal goals, and the accumulated sense of what the series was becoming. The difference was not in structure or accuracy. It was in voice coherence. One entity holding the full picture produces something that parallel agents, each holding a slice, cannot replicate.

This is not a limitation that scaling fixes. You cannot make voice coherence emerge from more agents or better prompts. Voice comes from a single perspective sustained across an entire piece, making choices that reference earlier choices, building rhythm that depends on knowing what rhythm has already been established. It is fundamentally a serial process. Parallelism helps with everything around it — finding material, checking facts, catching errors — but the writing itself resists distribution.

The scoring problem

The review pipeline used models from the same family for writing and scoring. Scores clustered between 8.2 and 8.75 across four episodes — a range too tight to be useful for ranking. Every episode got similar marks. Every episode got dinged on the same structural pattern (a tendency toward “both sides” argumentation). The scores were not wrong, exactly. They caught real weaknesses. But they could not discriminate between a good episode and a great one.

This is convergent evaluation: when the scorer shares architectural DNA with the writer, its quality model overlaps too heavily with the writer’s quality model. They agree on what good looks like, so everything the writer produces looks approximately equally good to the scorer. The fix is straightforward — use a different model family for scoring than for writing. Harder graders from different training lineages produce more variance and more useful rankings. At about three cents per review agent, running two or three different scoring models per episode is trivially cheap and dramatically more informative.

Five operational lessons

Skipping review before audio generation is a false economy. The first batch ran the full pipeline: write, dual review, rewrite, lint, generate. Reviews caught real factual errors — a misattributed data breach, a wrong bit-depth for MIDI velocity, fabricated quotes from unnamed experts, a wrong date for a historical experiment. The second batch skipped review to save time. Those episodes went straight to audio with unknown accuracy. At three cents per review agent, the math is obvious. Always review.

Fix infrastructure before launching batches. The first wave of six agents all failed because subagents did not inherit file permissions from the parent session’s configuration. This is a known issue. The fix took one minute: a project-level settings file with explicit permissions. That one minute of testing would have saved an hour of extracting partial results from JSON logs. Before any agent-heavy session, run a single test agent that writes a file and reads a file from outside the project directory. Confirm it works. Then launch the batch.

Generate one episode, listen, then batch the rest. Seven episodes were generated without anyone hearing a single second of audio. TTS pacing, sound effect placement, voice timing — all unknown until playback. If the first episode had a systematic problem — a sound preset that does not work, a voice that mispronounces a recurring term — all seven would share it. Generate one. Listen to five minutes. Adjust if needed. Then batch.

Match output volume to evaluation bandwidth. Seven episodes in one session is impressive throughput. But no one can meaningfully evaluate seven episodes back to back. By the time the listener reaches episode five, the context of what each episode was trying to achieve has faded. Three or four is the right number. Enough to test the concept and establish a pattern. Few enough to actually judge.

Use swarms to find, write the important stuff yourself. This is the governing rule. Swarms are discovery tools. They surface material, connections, and possibilities. The actual writing — the part where voice, rhythm, and argument come together — is best done by one entity with full context. Trying to parallelize writing produces adequate work. Trying to parallelize research produces extraordinary finds. Do each where it works.

The validated pipeline

This step-by-step template emerged from the session as the process that actually works:

Read the series spec, pick topics.
Launch exploration agents across all available sources (parallel, background).
Launch writing agents with the best material (parallel, background).
When drafts land: launch two review agents per episode — one for narrative, one for technical accuracy (parallel).
Compile review findings into specific fix lists.
Launch rewrite agents with those fix lists.
Lint all episodes for format compliance.
Generate ONE episode. Spot-check the audio.
Batch-generate the rest.
Push to feed.

Steps 4 through 6 add roughly twenty cents per episode and catch real errors. Step 8 is insurance against systematic TTS problems. Neither is optional.

When to use swarms, and when not to

Swarms earn their keep when the task is exploration: searching unfamiliar territory, gathering material from many sources simultaneously, finding connections across domains that a single researcher would take hours to traverse. Fifteen agents reading through an archive, a codebase, and the web in parallel will find things that sequential search will miss entirely. The medieval carpet, the unsolved bicycle problem, the slime mold railway — none of these came from a search query. They came from agents browsing broadly and surfacing what was interesting.

Swarms do not earn their keep when the task is coherence: writing that needs a sustained voice, creative work where the connections between pieces are the value, anything where the whole must feel like it came from one mind. You can parallelize research. You can parallelize fact-checking. You can parallelize review. You cannot parallelize the act of caring about how one sentence leads to the next.

The finding is simple, and it held up across seven episodes and three hours of audio: let the swarm explore, then write it yourself. The agents will find things you never would have. But the thing that makes a piece of writing worth listening to — the sense that someone is thinking this through, live, in front of you — that is still a serial process. Use the parallel tools for parallel work. Save the serial work for a single, attentive mind.