How we shipped a five-phase production audit in three days

If you’ve ever hired a consulting firm to do a “production-readiness audit” on a software product, you know the script. A team of five (project manager, two engineers, a designer, a security reviewer) shows up, spends a week shadowing, six weeks writing reports, and another two weeks implementing the fixes. Total bill: $60–$120k. Total elapsed time: about three months.

We just did the same kind of audit, on our own product, in three days, with two cofounders. The whole thing landed across 26 commits, all reviewed, all tested, all in production. This isn’t a brag — it’s a workflow we want to lay out, because we think it’s genuinely how this kind of work is going to be done in 2026 and beyond.

Here’s how it actually goes.

Step 1: A discovery pass that scores the system, not the team

Before writing a single line of fix, we spawn five to eight read-only research agents in parallel — each scoped to one dimension. One looks at security. One looks at observability. One looks at compliance. One looks at performance. One looks at operational readiness. They each return a tight written report (under 1,000 words) with file paths, line numbers, and concrete examples of what’s broken.

This is the part that’s genuinely different from a human audit. A human team has to time-share — they can audit security this week, performance next week. Agents can do all of it concurrently. We get the whole picture by the end of the morning.

The output is a list. Ours had 46 items.

Step 2: A plan, not a sprint

We don’t open a ticket for each item and start working. We sort the 46 into three buckets:

Critical (blocks confident customer onboarding). About 6–10 items.
Important (hardens before the first paying client). About 10–15.
Polish (improves trust, raises the floor on UX). The rest.

Each bucket becomes a phase. Phases ship independently — the critical fixes don’t wait on the polish work to be ready.

This sounds obvious but it’s the part most teams skip. Without phasing, the “fix everything” mandate either bogs down the team or gets cherry-picked into whatever feels easiest. With phasing, every item has a deadline and a sequence.

Step 3: Worktree-isolated implementation

Now the work. For each phase, we spawn one implementation agent per logical concern (between three and six per phase) — each in its own isolated copy of the codebase. They all start from the same base commit. They each have a focused scope: “fix the IDOR in the projects API,” “wire frontend Sentry,” “add the GDPR data export endpoint.” They each have to write tests. They each have to run the verification before committing.

Because each agent is in its own copy of the codebase, they can’t accidentally step on each other. We’ve had two agents touch the same file maybe a dozen times across this pattern, and the merge conflicts have been small and obvious — five lines, both wanting to add to the same enum, easy to resolve.

The tradeoff: when an agent makes a mistake, you don’t catch it until cherry-pick time. That’s usually fine, because each agent’s output is small and reviewable. But you do need someone walking through the diffs before merging. That someone is one of us — a human, with the agent’s report sitting next to the diff for context.

Step 4: Verification that doesn’t depend on the agent saying it’s done

The most important rule we’ve learned: never trust an agent that says “tests pass.” Always re-run them yourself. Specifically:

Run the full test suite, not just the new tests.
Run the type checker.
Run the linter, with the strictest gate the project has.
Build the production bundle.

If any of those break, the cherry-pick gets rejected and the agent gets sent back with the failure log. We do this every time. It’s annoying and it’s how you avoid the “I shipped a bug because the AI said it was fine” story.

What it looks like, end-to-end

For our recent audit, the numbers came out:

Discovery: 8 read-only research agents, ran in parallel, took about 30 minutes wall-clock.
Phase A (critical, 6 items): 5 implementation agents in parallel, about 90 minutes wall-clock, 410 → 446 backend tests passing.
Phase B (hardening, 5 items): 5 agents, about 90 minutes wall-clock.
Phase C (polish, 6 items): 6 agents, about 2 hours wall-clock.
Phase D (cleanup, 3 items): 3 agents, about an hour.
Phase E (final small fixes, 4 items): 4 agents, about an hour.

Plus reviews, plus debugging two agents that made mistakes, plus opening and merging the PR. Total wall-clock: about three days, with both cofounders working through it. Test count went from 410 to 532. Frontend went from zero tests to twelve. ESLint went from 50 problems to zero.

What this isn’t

This isn’t “AI replaces engineers.” The orchestration — knowing what to audit, how to scope each agent, when to push back, how to triage the diffs — is the actual work. Without that, you get a flood of code you can’t verify and don’t fully understand. We spent more time reviewing diffs than writing prompts.

It also doesn’t generalize to every problem. Greenfield work where the agent has to invent a new architecture from a fuzzy spec is still where this pattern struggles most. Audits, refactors, additive features in well-defined parts of an existing codebase — that’s the sweet spot.

If you’re sitting on a product that needs a hardening pass, the math is genuinely different than it was eighteen months ago. Three days, two people, a verifiable list. Worth knowing what’s on the other side of it.