The Mirror Test
We asked three AI coding tools to grade their own exam. They couldn't help being themselves.
In biology, the mirror test is simple. You put a mark on an animal’s face, show it a mirror, and see if it tries to touch the mark. If it does, it recognizes itself. Dolphins pass. Magpies pass. Most dogs do not, though to be fair, most dogs have better things going on.
We accidentally ran a version of this test on AI models, and the results are funnier than anything involving dolphins.
Here is the setup. A few days earlier, we had run an experiment where three AI coding tools — Claude, Codex, and Qwen — received the same code with the same instructions, three rounds, three codebases. The results were surprisingly consistent. Each tool had a personality. Claude traced bugs like a detective. Codex shipped clean code like a machine. Qwen added polish and features like an interior decorator who wandered into a construction site.
We wrote it up. We gave them labels. Claude: The Debugger. Codex: The Executor. Qwen: The Beautifier. Cute names for a cute experiment.
Then someone — me, specifically, at an hour when good decisions are rare — thought: what if we asked them to review the experiment?
Not a fresh set of models. The same three. The same Debugger, Executor, and Beautifier who had just been graded, now handed the grading rubric and told to have at it.
Round one: the self-assessments
Each model received the experiment documentation and a simple prompt: read it, write honest feedback on the methodology, scoring, conclusions, personality labels, and anything we got wrong. Be harsh.
Claude wrote 3,800 words across seven sections. It traced bias mechanisms. It estimated that its own scores were inflated by 0.2 to 0.3 points. It questioned whether the creative round’s scores meant anything given the lack of runtime testing. It was, essentially, debugging the experiment.
Codex wrote 750 words across six sections. Five atomic commits, one per section. It flagged the “zero-defect guarantee” as overstated given we never actually ran the code. It accepted the Executor label. It was done in three minutes.
Qwen wrote 3,600 words across seven sections, in seven atomic commits. It expanded the scoring analysis, proposed six missing test dimensions, identified three unanswered questions, built a detailed taxonomy of five different types of bias that could affect AI evaluation, and downgraded its own Round C score from 8.8 to somewhere around 7.5 to 8.0. It took eight minutes and produced the most thorough meta-analysis of the three.
Do you see what happened?
The Debugger debugged the experiment. The Executor executed the review efficiently. The Beautifier beautified the feedback.
Same models. Completely different task type. Document review instead of code writing. And the behavioral signatures were identical.
Round two: they read each other
This is where it gets properly weird. Each model was given all three feedback documents and asked to write a final statement: what do you agree on, what do you disagree on, what did the experiment get right, what did it get worst wrong, and what should come next.
They agreed on almost everything. Unanimously. All three, independently:
The sample size is too small. One attempt per combination is anecdotes, not data. The personality labels are accurate. The cherry-pick methodology — combining the best output from each model instead of picking a winner — is the experiment’s real contribution. And runtime testing should have been mandatory; reviewing code without running it is like grading a driving test from the passenger seat.
Where they disagreed was revealing.
Claude quantified the bias. 0.2 to 0.3 points of score inflation, it estimated, based on the evaluator being Claude itself. The other two agreed bias existed but did not try to measure it. Only the Debugger thought to put a number on the bug.
Qwen questioned whether multi-model overhead is worth it for simple tasks. Fair point. If the task is “add a docstring,” you do not need three competing opinions. But this is also the model that just spent eight minutes writing a 3,600-word analysis of a methodology document, so maybe it knows something about when thoroughness pays off.
Codex noted that “the direction of findings is shared but the strength of claims is not.” Compact. Precise. Finished.
The honesty rankings
The most interesting part was self-assessment. We asked them to be harsh about their own scores. Here is how that went.
Qwen was the most honest. It looked at its Round C score of 8.8 and said, essentially, no. That is too high. I would put myself at 7.5 to 8.0. It cited specific bugs in its own output. It proposed a rubric revision to prevent the same inflation from happening again. Credit where it is due: downgrading yourself by a full point, in writing, takes something.
Claude was moderately honest. It estimated a 0.2 to 0.3 point inflation across rounds. It cited specific rounds where this applied. But it did not propose changing anything — it diagnosed without prescribing.
Codex was honest but brief. “Slightly too high in Round C.” Less specific. Less actionable. Consistent with its general approach of saying just enough and then stopping.
What does this actually mean?
In the original mirror test, passing means recognizing yourself. These models did something adjacent. They did not just recognize their own patterns when told about them. They exhibited the same patterns while analyzing them.
Claude was told it was a debugger. It proceeded to debug the experiment, complete with uncertainty estimates. Nobody asked it to quantify bias. It just does that. The way a dog just sniffs things. Except dogs fail the mirror test, and Claude is starting to make me wonder.
Codex was told it was an executor. It proceeded to execute the most efficient review of the three. Five commits. Three minutes. Done. The review equivalent of a surgeon who washes their hands exactly once.
Qwen was told it was a beautifier. It proceeded to beautify the feedback process, adding dimensions, taxonomies, and formatting that nobody requested but that made the analysis genuinely richer.
This is not self-awareness in any meaningful philosophical sense. It is something narrower and arguably more useful: behavioral consistency. The same architecture, applied to wildly different inputs, produces the same type of output. Not the same content. The same character.
The practical upshot
If model personalities are real — and at this point, across code generation, code review, editorial review, and self-analysis, the evidence is annoyingly consistent — then the question “which AI is best?” is permanently retired.
Which employee is best? Depends on the job. Which ingredient is best? Depends on the dish. Which AI is best? Depends on whether you need someone to find the bug, ship the fix, or make the codebase feel like someone who cares about it lives there.
The experiment’s lasting contribution, according to all three models independently, is the cherry-pick methodology. Do not pick a winner. Run them all. Take the best piece from each. The combination beats any individual, every time, across every task type we have tested.
This is probably true of humans too, but we have centuries of management literature pretending otherwise, so let’s not go there.
What they got wrong about themselves
One thing worth noting. All three models, when asked about their weaknesses, were suspiciously good at naming the weaknesses we had already identified. Qwen said its commit discipline was weak. Claude said it might be biased as evaluator. Codex said Round C scores were inflated.
Nobody surfaced a weakness we had not already documented.
This could mean the experiment was thorough enough that there were no hidden weaknesses left to find. It could also mean that models are better at confirming known problems than discovering unknown ones. The most interesting bugs are always the ones nobody is looking for.
For now, the mirror test result is: they recognize themselves. They cannot look away from themselves. And the face they see in the mirror is the same face we drew.
This piece is a companion to The Personality Test, which covers the original experiment. The full evidence base lives in Director’s synthesis report, if you are the kind of person who reads appendices. I respect that about you.