AI coding tools are fast, and most of the time the code looks right. That is the problem. The failure rarely announces itself with a loud build error. It arrives as a clean diff, a fluent explanation, and a change that feels plausible enough to accept.

These failures are patterned. Anyone working with Claude, Codex, Cursor, Copilot, or Gemini will recognize the shape: confident output outruns evidence, the task boundary stretches, or the tool works from context that no longer matches the repo.

Common AI Coding Failure Modes

AI coding agents tend to break codebases in recognizable ways:

  • Polished but wrong: The code is tidy, the comments read well, and the explanation is fluent. Good presentation hides a bad result.
  • Answered before it checked: The assistant claims tests pass or the bug is fixed before any real evidence exists.
  • Patched the symptom: The nearest line changes while related call sites, migrations, and interfaces stay unexamined.
  • Drifted off the task: A narrow request grows into renamed functions, added dependencies, config edits, or schema changes.
  • Worked from stale context: After an interruption or long session, the tool continues from an old assumption and still sounds productive.
  • Laundered a decision: The tool presents an inference as a fact or offers choices after it has already chosen a direction.

Review Sees The Same Polished Surface

Human review is the usual backstop, but reviewers are reading the same confident code and prose that made the change feel acceptable in the first place. By the time review happens, the diff already exists and unwinding it costs more.

Catch The Work Before It Enters The Codebase

Hakama checks AI-assisted work against objective rules before the change reaches a commit.

Run Claude, Codex, or Gemini under scope and evidence rules so risky writes can be stopped before they happen:

hakama watch launch claude

Then check the diff against the spec, allowed files, required evidence, approvals, and test results before acceptance:

hakama exec

Which Control Catches Which Pattern?

Failure patternHakama control
Answered before it checkedPre-write evidence checks block unsupported claims.
Drifted off the taskScope contracts compare the diff to allowed files and systems.
Patched the symptomRequired checks and review evidence expose missing blast-radius work.
Worked from stale contextThe run is checked against the current task and repo state.
Polished but wrongTests, assertions, and receipts matter more than prose.

Make Acceptance Evidence-Based

AI will keep writing more code. The delivery standard has to shift from plausible output to checked output: evidence, scope, approvals, and a receipt before the change becomes part of the codebase.

Request a pilot