What AI actually can’t do in a real build

There’s a version of the AI development pitch that goes like this: hand the AI a problem, walk away, come back to working software. Sometimes the demo even looks like that. The reality of a production build is different, and it’s worth being specific about where.

We use AI heavily — it’s part of why our build cycles run three months instead of nine. But the model where AI does the work and the engineer reviews it is not the model we operate. The model we operate is one where the engineer holds the judgment and the AI does the typing. The distinction matters, because every time we’ve let it drift the other way, we’ve shipped a bug.

Here’s the list of things AI consistently can’t do, in our experience, on real client work.

1. Resolve an ambiguous spec correctly

The biggest source of subtle production bugs we’ve seen from AI-generated code isn’t bad code — it’s plausible code written against the wrong interpretation of the spec.

A request like “users should be able to delete their account” sounds unambiguous until you actually build it. Does delete mean soft-delete with a 30-day recovery window, or hard delete? Does it delete the team they own, or transfer ownership? Does it remove their messages from other users’ threads, or anonymize them? Does it count toward the GDPR “right to erasure” timeline?

A human engineer on a real build asks four questions before writing the first line. An AI agent fills in the defaults that look most like its training data and produces code that compiles, tests, and is silently wrong in a way you won’t notice until a real user does the thing.

The fix is structural: someone with context has to translate the ambiguous spec into a precise one before the AI sees it. That’s a senior judgment job, and it’s not going away.

2. Make security trade-offs that depend on context outside the codebase

AI writes secure-looking code by default. It uses parameterized queries. It validates inputs. It hashes passwords. The problem is the security calls that depend on knowing things the code doesn’t tell you.

Example: an endpoint that returns a user’s profile. The AI-generated version checks that the requesting user is authenticated. It doesn’t check whether the requested profile belongs to the requesting user — because nothing in the code says it should. Whether that’s a bug or a feature depends on the product: a public-profile platform wants it that way, an internal CRM definitely doesn’t. The AI can’t tell.

Multiply that pattern by every endpoint in a system. AI-generated code is “secure” in the sense that it doesn’t have classic vulnerabilities. It’s not secure in the sense that the authorization logic matches the real privacy boundaries of the product. That alignment is judgment work.

3. Invent novel architecture

AI is excellent at remix and pattern-matching. Show it a codebase that uses a specific pattern, and it will produce more code in that pattern that fits cleanly. Give it a novel problem with no obvious precedent — a new kind of multi-tenant isolation, a new event-sourcing approach, a billing model nobody’s built before — and the output collapses into recognizable patterns that don’t quite fit.

This is the part of engineering work that’s most resistant to AI amplification. The leverage stays where the precedents are dense. Greenfield architecture work is still mostly human, and our discovery phase is structured around getting that work done before any AI gets near the codebase.

4. Hold a stakeholder map in its head

A surprising amount of senior engineering work is political, in the small-p sense. The marketing team wants a feature. The CS team is afraid of how customers will react. The CTO has a separate concern about how it’ll interact with the cache rewrite already in flight.

A human engineer holds all of those threads, often without writing any of them down, and produces a design that threads the needle. An AI agent, given the same spec, produces a design that satisfies the literal request and silently makes the cache problem worse, because nobody mentioned the cache in the prompt.

The fix is preparation. Before AI gets involved, a senior engineer writes down what the prompt has to know — the constraints, the in-flight projects, the politics. That document is itself a senior judgment artifact. Without it, the AI can’t help.

5. Make the regulated-industry judgment calls

We’ve shipped work in healthcare and fintech, and the pattern in both is the same: AI is great for the 90% of the code that’s just code, and useless for the 10% where compliance calls are being made.

“Should this log line include the user’s identifier?” depends on whether the log is going to a HIPAA-covered system, whether the user has consented to that processing, and whether your business associate agreement covers it. None of those facts are in the codebase. None of them are in the spec. They’re in your compliance posture, your legal advice, and the conversation you had with the auditor last week.

You can’t shortcut this with a more capable model. The information the model would need isn’t documented anywhere a model can read.

What this means in practice

The pattern across all five: AI accelerates the work where the answer is already implicit in the existing code or the spec. It stalls — or worse, silently fails — where the work requires judgment, novel reasoning, or context that lives outside the codebase.

Our build pattern leans into this. Senior engineers do the discovery, the architecture, the scope, and the compliance work upfront. The AI takes the boilerplate, the test scaffolding, the documentation drafts, and the rote implementation of patterns the codebase already establishes. A senior reviews every output before it lands.

The reason we still bet the business on the AI half isn’t that AI is magic. It’s that the “rote work” is genuinely most of the keystrokes. When the senior is in the loop, removing the keystrokes is a real multiplier. When they’re not, you ship a bug that looks fine until production.

If you’re evaluating a firm that pitches AI development, the question to ask is structural, not technical: who is doing the judgment work, and how do you know? The answer should be a name, a process, and a paper trail.