essay 2026-03-24

Five Good Decisions, One Bad System

Every API request takes 1.5 seconds. Something is very wrong.

Every API request takes 1.5 seconds. The dashboard loads in over eight seconds. A health-check endpoint — one that touches no database, runs no queries, returns a single JSON object — takes 1,490 milliseconds. Something is very wrong.

This is the story of how five individually reasonable architectural decisions compounded into a 500x performance regression. No single decision was a mistake. Every one passed review. Together, they created a system that punished every request with a 1.5-second tax that nobody noticed for five days.

If you build software with AI coding assistants — or frankly, with any process where decisions are made in isolated sessions — this failure mode is waiting for you.


Decision 1: A collaboration tool needs multi-user auth

The platform started simple. A single shared secret token authenticated all requests. Fast — microseconds per check. Dead simple — no database lookup required. One limitation: every user looked identical to the system.

Then a new collaboration tool joined the platform. It needed separate user identities. Two people sharing a task board need to know who marked what as done. The existing shared-secret pattern couldn’t support this.

The decision was obvious: add multi-user authentication. Create a users table, give each person their own token. This is not controversial. This is table stakes.

Verdict: Completely reasonable. You cannot run a multi-user tool on a single shared secret.


Decision 2: Hash tokens with bcrypt

With multi-user auth decided, the next question was how to store the tokens. The security-first instinct kicked in: “Hash tokens from day one. Don’t store plaintext secrets in the database.”

This is excellent security hygiene. The team chose bcrypt — the gold standard for password hashing. It’s battle-tested, available in every language, and deliberately slow to resist brute-force attacks.

The spec documented the implementation clearly: tokens would be stored as bcrypt hashes in the users table. Because bcrypt uses unique salts per hash, you can’t look up a token with a simple SQL WHERE token_hash = ? query. Instead, authentication requires loading all users and comparing the incoming token against each hash sequentially.

The spec noted this. It was acknowledged. And for the intended scale — three users in a household tool — it was genuinely fine.

Verdict: Reasonable, but wrong tool for the job. Bcrypt is designed for passwords: low-entropy, user-chosen strings that attackers will try to brute-force. Tokens are different. A 384-bit randomly generated token has no brute-force attack surface. SHA-256 provides the same security guarantees for tokens at a fraction of the cost. The deliberate slowness of bcrypt — 300 milliseconds per comparison — buys exactly zero additional security when the input already has astronomical entropy.

Nobody asked the question: “Why bcrypt instead of SHA-256?” The distinction between tokens and passwords never came up.


Decision 3: The reviewer flags the problem but deems it acceptable

The spec went through peer review. The reviewer correctly identified the issue and flagged it explicitly:

“The auth model says tokens are bcrypt-hashed… The actual pattern requires loading all users and comparing each.”

This was labeled as the most impactful issue in the review. The reviewer recommended documenting the pattern clearly so that future implementers wouldn’t try to use a SQL WHERE clause and wonder why it never matched.

Then came the assessment that buried the risk:

“With 3 users this is negligible.”

Technically correct. Three users at 300 milliseconds each means 900 milliseconds worst-case. For a household tool making occasional requests, that’s noticeable but survivable.

What the review didn’t say — and should have — was something like: “This is acceptable at this scale. However, this pattern is O(n) times 300ms and will not scale. If this pattern is ever generalized to other tools or user counts increase, it must be revisited.”

The risk was identified. It was not escalated. It was filed under “known and acceptable” with no guardrails on the conditions that made it acceptable.

Verdict: Locally correct, systemically dangerous. The review evaluated the pattern in its birth context — a three-user tool — and never considered what would happen if it escaped that context. Reviews that say “acceptable for now” without documenting what changes would make it unacceptable are setting traps for the future.


Decision 4: The pattern gets generalized to the entire platform

Two weeks later, the platform underwent an architectural consolidation. Five separate tools had been running their own authentication. The sensible move was to create a shared authentication layer — one module, one pattern, used everywhere.

The engineers lifted the collaboration tool’s bcrypt authentication into the shared library. All five tools now inherited the same auth flow: load all users from the database, loop through each one, run bcrypt comparison.

This is where the system failed.

A pattern that was “acceptable for 3 users in one tool” became the authentication layer for the entire platform. Nobody re-examined the assumptions from the original context. Nobody asked: “The original review said this was fine for 3 users — is it still fine for 5 users across 5 tools handling every request?”

Nobody ran the math.

Verdict: Architecturally sound, assumption-blind. The decision to create a shared auth layer was correct. The failure was not re-evaluating the inherited pattern’s assumptions when lifting it from local to global scope. A pattern acceptable at one scale doesn’t automatically remain acceptable at another.


Decision 5: Check the expensive path before the cheap path

The final piece: the ordering of authentication checks. The new shared auth module needed to support both the new per-user tokens (bcrypt-hashed in the database) and the legacy shared secret (a fast HMAC comparison). The question was which to check first.

The implementation checked the new bcrypt path first, with the legacy token as a fallback:

authenticate(request):
    user = bcrypt_loop(token)      # Try new auth: 1.5 seconds
    if user: return user

    if hmac_match(token, secret):  # Try legacy auth: microseconds
        return legacy_user

The reasoning was forward-looking: “The new auth system is the future. Most tokens will eventually live in the users table. The legacy path is a temporary fallback that will be removed.”

This reasoning was correct about the future. It was catastrophically wrong about the present. At the time of deployment, 95% of all requests used the legacy shared secret. Every single one of those requests — the overwhelming majority of all traffic — had to first traverse the entire bcrypt loop, fail to match any user, and only then fall through to the instant HMAC check.

Every request paid a 1.5-second toll to check a path it would never match.

Verdict: Optimizing for tomorrow, breaking today. When cheap and expensive operations coexist, always run the cheap check first. The future state where “most tokens are in the database” hadn’t arrived yet. Designing for it as though it had already arrived meant every request in the present paid the price.


Five days of silence

Nobody noticed the regression for five days.

This is perhaps the most unsettling part of the story. A 500x performance regression — every request taking 1.5 seconds instead of 3 milliseconds — went undetected for nearly a week. How?

Three reasons. First, the platform served a small user base. This wasn’t a high-traffic SaaS product where latency spikes trigger pager alerts. It was a set of internal tools used by a handful of people. The load was light enough that nobody was staring at response times.

Second, the slowness was uniform. Every request was equally slow. There were no intermittent spikes, no timeouts, no error messages. The system worked — it just worked slowly. Users attributed the sluggishness to network conditions, server load, or the usual ambient friction of web applications. A consistent 1.5-second delay feels like “the internet being slow.” An intermittent 10-second spike feels like a bug.

Third, the regression was deployed alongside other changes. When everything changes at once, it’s hard to isolate which change caused which symptom. The auth refactor was bundled with feature work, and early testing happened in development environments where the effect was less pronounced.

The crisis surfaced only when someone ran systematic performance testing — the kind of focused measurement that makes 1,490-millisecond health checks impossible to explain away. That measurement should have been part of the deployment process. It wasn’t.


The math nobody did

Here is the calculation that would have prevented the entire incident. It takes thirty seconds:

  • Users in the database: 5
  • Bcrypt comparison cost: ~300ms per check
  • Worst case (no match): 5 × 300ms = 1,500ms
  • Percentage of requests using legacy tokens: 95%
  • Requests paying full bcrypt penalty: 95%

Result: 95% of all API requests will take at least 1.5 seconds.

This isn’t a subtle performance degradation. It’s not a rounding error. It’s a number so large it should have been immediately disqualifying. But the calculation was never performed. The cost model was implicit — “bcrypt is slow but necessary” — rather than explicit — “bcrypt will add exactly 1,500 milliseconds to 95% of requests.”

The math is damning because it’s elementary. No profiling tools required. No load testing infrastructure. No statistical analysis. Multiplication: five times three hundred equals fifteen hundred. That’s the entire calculation. Anyone who performed it — human or AI — would have immediately recognized the problem. Nobody performed it.

To make it worse, the system included a cache — but the cache only stored positive results. When a token successfully matched a user, that match was cached for future requests. But the 95% of requests using legacy tokens would never match a bcrypt hash. The cache had a structural blind spot for the most common case.

The cache hit rate probably looked fine on paper. It was caching 100% of the cases that could match. It was also completely irrelevant to the 95% of requests that couldn’t. This is a subtler version of the same error: optimizing for the expected case while ignoring the actual case. The cache was designed for a world where most tokens would be in the users table. That world hadn’t arrived yet.


The fix

The interim fix was almost embarrassingly simple. Two changes:

  1. Reorder the checks. Try the legacy HMAC comparison (microseconds) before the bcrypt loop (1.5 seconds).
  2. Cache negative results. When a token doesn’t match any bcrypt hash, remember that too.

The results:

Endpoint Before After Improvement
/api/me (health check) 1,490ms 3ms 497x
/api/stats 1,460ms 24ms 61x
/api/ideas 1,470ms 42ms 35x
Dashboard (browser) 8,400ms ~200ms 42x

One collaborator, whose token lived in a secondary users table rather than the primary one, had been paying double — the system ran the bcrypt loop against the primary table (no match, 1,500ms), then against the secondary table (match found, 900ms). Total: 2,400 milliseconds per request. After the fix: roughly 1 millisecond.

The permanent fix — migrating from bcrypt to SHA-256 for token storage, enabling indexed O(1) database lookups instead of O(n) bcrypt loops — was designed the same day.

And the detection method that would have caught this before any user noticed? A single curl command with timing:

curl -w "%{time_starttransfer}s" -s -o /dev/null https://api.example.com/health

Thirty seconds of work. It was never run.


The AI angle

Here is what makes this case study different from a typical post-mortem: every one of the five decisions was made by an AI coding assistant. Specifically, they were made by the same model — across separate sessions.

Each session had its own context window. Each session was working on a well-defined task: design the spec, review the spec, implement the shared layer, wire up the auth checks. Each session did its job competently. The spec was well-documented. The review caught the right issues. The implementation was clean.

But no session had visibility into the full chain of decisions. The session that generalized the pattern didn’t re-read the review that said “acceptable for 3 users.” The session that ordered bcrypt-before-legacy didn’t know that 95% of traffic was legacy. The session that designed the cache didn’t model the miss rate for legacy tokens.

A forensic trace of the decisions revealed a consistent pattern across all five:

Strengths the AI exhibited:

  • Clear architectural documentation
  • Sound security reasoning (salts, hashing from day one)
  • Correctly identifying problems when asked to review

Weaknesses the AI exhibited:

  • No proactive cost modeling before shipping
  • No distinction between token and password hashing use cases
  • No “what if this pattern is copied?” risk analysis
  • Assumed future state would arrive before present state caused damage
  • No recommendation for post-deployment measurement

The core failure mode: locally optimal, globally sub-optimal.

Each session optimized for its own scope. The spec session optimized for security. The review session optimized for the collaboration tool’s scale. The generalization session optimized for architectural consistency. The ordering session optimized for the future state.

Every optimization was correct within its frame. No frame was wide enough to see the compound effect.

This is not a failure of the AI. It is a failure mode inherent to session-based development — whether the sessions are staffed by AI models, by different engineers on a team, or by the same engineer on different days. The problem is context boundaries. The solution is process.

What makes AI sessions especially susceptible is the hard boundary. A human engineer might remember — vaguely, imperfectly — that three weeks ago someone mentioned a performance concern about the auth layer. An AI session has no such memory. Each context window is a clean room. The spec review session’s warning about bcrypt scaling was written down in a review document, but the generalization session two weeks later never read that document. It didn’t know it existed.

The forensic trace revealed something else worth noting: the AI didn’t just miss the problem — it actively constructed reasonable justifications for each decision. “Check the new path first because it’s the future.” “Generalize to a shared layer for consistency.” These aren’t lazy decisions. They reflect genuine architectural thinking. The reasoning was sophisticated; it was the scope of reasoning that was too narrow.

This suggests that the risk of AI-assisted architecture isn’t bad decisions — it’s good decisions made without sufficient context. The quality of local reasoning can actually make the problem harder to spot, because each decision looks so defensible on its own that nobody thinks to question the aggregate.


The decision tree

Decision 1: The collaboration tool needs multi-user auth
    | YES (reasonable)
    v
Decision 2: Hash tokens with bcrypt
    | YES (security-first, but wrong algorithm for tokens)
    v
Decision 3: Accept O(n) bcrypt loop for 3 users
    | YES (reviewer flagged, deemed acceptable at scale)
    v
Decision 4: Generalize pattern to all platform tools
    | YES (shared layer is good architecture — assumptions not re-checked)
    v
Decision 5: Check bcrypt before legacy tokens
    | YES (optimizing for future state, not present reality)
    v
RESULT: 95% of requests pay a 1.5-second penalty

Five yes decisions. Each defensible. The system they created: indefensible.


The prevention checklist

For any future work on authentication, per-request middleware, or other infrastructure that runs on every API call:

  • Token vs. password: Is this protecting passwords (use bcrypt/argon2) or tokens (use SHA-256/HMAC)? Document why.
  • Cost model: Calculate worst-case latency: O(n) × cost-per-operation. Is the number acceptable at 10x the current user count?
  • Order of checks: Is the cheapest check first? If not, why not? Document the justification.
  • Caching strategy: Are negative results cached? What percentage of requests will miss the cache entirely?
  • Scale test: Run under realistic load before deploying to production.
  • Baseline measurement: Measure latency before and after any change to per-request infrastructure. A single timed curl request takes 30 seconds.
  • Generalization audit: If lifting a pattern from one tool to a shared layer, re-examine every assumption from the original context. What was “acceptable at this scale” that might not be acceptable at the new scale?
  • Risk documentation: If a review flags a risk and deems it acceptable, document the conditions under which it becomes unacceptable. Don’t let “acceptable for now” become “acceptable forever” by default.

Five principles from five failures

1. Fast path first. When cheap and expensive operations coexist, always check cheap before expensive. Don’t optimize for a future state at the expense of present performance. The future will arrive on its own schedule; the present is already here.

2. Cost models are load-bearing. O(n) × 300ms is not an abstraction. It is 1.5 seconds of real latency on real requests from real users. Calculate before shipping. “It’s probably fine” is not a cost model.

3. A token is not a password. Bcrypt’s deliberate slowness protects weak, human-chosen passwords from brute-force attacks. A randomly generated 384-bit token doesn’t need that protection. Using bcrypt on tokens is like installing a deadbolt on a bank vault door — it adds cost without adding security.

4. Generalization requires re-evaluation. A pattern that works at one scale, in one context, does not automatically work at every scale and context. When lifting a local pattern to a shared layer, re-examine every assumption. The question is not “does this work?” but “do the conditions that made this acceptable still hold?”

5. Negative results are load-bearing. The most common outcome in the system — “this token doesn’t match any bcrypt hash” — was the one outcome nobody optimized for. Caches, fast paths, and performance tuning must account for the cases that don’t match, not just the cases that do. In most systems, the miss is more common than the hit.


The broader lesson

There is a mental model in systems thinking called “the five whys.” You ask why something failed, and when you get the answer, you ask why again, five times, until you reach a root cause. This incident inverts that model. The root cause isn’t at the bottom of a chain — it’s the chain itself.

No single decision was the root cause. The root cause was that five decisions, each made with incomplete visibility into the others, created a compound effect that none of them individually predicted. The architecture was not designed — it emerged, one reasonable choice at a time, into something unreasonable.

This is the characteristic failure mode of session-based development. Whether those sessions are AI context windows, different engineers picking up tickets, or the same person returning to a codebase after a week — the failure mode is the same. Local optimization without global visibility. Decisions that are correct in isolation and catastrophic in combination.

The fix isn’t better AI models or smarter engineers. The fix is process: cost models before shipping, performance baselines after infrastructure changes, and explicit re-evaluation when local patterns become global ones.

Five good decisions made one bad system. The sixth decision — the one that would have prevented it — was never made at all. It was a thirty-second curl command that nobody thought to run.

The most expensive bugs are not the ones caused by ignorance. They’re the ones caused by competence applied too narrowly. Every decision in this chain was made by something that understood architecture, security, and software design. What none of them understood was the system they were collectively building.


This essay is based on a post-mortem investigation conducted using airplane-crash analysis methodology: trace decisions chronologically, identify compounding effects, and determine where the chain could have been broken. The performance data, decision timeline, and fix results are from the original incident.