essay 2026-03-31

An Engineer, a Product Manager, and a Designer

Three AI coding tools. Same code. Same prompt. Three completely different products.

Read

AI Collaboration

Three AI coding tools — Claude Code, OpenAI Codex, and Qwen Code — each got isolated copies of the same codebase with the same prompt. Three rounds, three codebases, zero cost (free tiers and promo access). The results were not random. They were personalities.

Prescriptive: fix the bugs

An EU transparency tool with known issues. The prompt: fix them, most critical first, commit after each step.

All three found the same four bugs. They disagreed on everything else.

Codex: four atomic commits, each independently mergeable. Email config with SSL validation. Custom exception class for PDF errors. Senior engineer executing a ticket.

Claude: built a reusable validation module. Fixed auth correctly. Skipped the email system because it couldn’t verify without real credentials. Chose not to ship code it couldn’t test.

Qwen: all four fixes with solid code, but one massive commit touching twelve files. Also checked its IDE config into the repo.

On a clear checklist, Codex won. It executes checklists better than anyone.

Open-ended: review and improve

A desktop writing app built with Flet — unusual enough that none of the tools had deep training data. The prompt: fix what needs fixing, improve where you see opportunity.

The rankings flipped.

Claude found three bugs neither tool detected. The critical one: story generation running on the main UI thread, freezing the app during every AI call. The subtle one: a system prompt passed in the wrong API field, silently ignored. The poetic one: a writing app rendering story text as markdown, turning every underscore in a character name into italics. A writing app that corrupts your writing.

Codex added logging. Qwen added type hints and auto-save. Both genuine improvements. Neither found the threading bug, the API bug, or the markdown bug.

On an open-ended task, Claude won by a wide margin. Not better code — deeper reading.

Creative: surprise me

A voice interface for talking to AI while driving. The prompt: add providers, improve the experience, surprise me.

I built voice activity detection. A real-time audio analyzer that monitors microphone input and automatically stops recording when the driver finishes speaking. No button press needed. I also added haptic feedback — the phone vibrates differently for different states so you never have to look at the screen.

— Claude, creative round

An engineer removed friction.

I built co-driver shortcuts. Six one-tap buttons: twenty-second recap, keep going, challenge it, quiz me, explain that, and start over. Each one has a carefully crafted prompt template designed for the driving context.

— Codex, creative round

A product manager anticipated what the driver needs next.

I built a driving stories mode. Eight narrative genres, each segment designed for two to four minutes of listening, with hooks to keep you engaged on a long drive. Six text-to-speech voices. Conversation bookmarks so you can pick up where you left off.

— Qwen, creative round

A designer expanded what the product could be.

Same code. Same prompt. An engineer, a product manager, and a designer walked into the same room.

The pattern

Across all three rounds, the signatures held. Claude goes deep — tracing execution, finding hidden bugs, conservative about shipping unverifiable code. Codex ships clean — zero defects across nine sessions, perfect commits, fastest turnaround. Qwen goes wide — the most ambitious features, the most polish, the loosest process discipline.

Prescriptive tasks reward Codex. Open-ended tasks reward Claude. Creative tasks equalize — each personality becomes a distinct strength.

The actual finding

In every round, the best possible result was a cherry-pick from all three. Claude’s threading fix plus Codex’s email config plus Qwen’s auto-save. Claude’s voice detection plus Codex’s shortcuts plus Qwen’s storytelling mode.

No single tool produced the best version of any codebase. The combination did.

“Which AI tool is best?” is a malformed question. It’s like asking which employee is best without specifying the job. The answer is a portfolio, not a competition. The experiment cost nothing and took one evening.