Everyone’s using AI coding tools now. Your team is. My team is. The startup down the street that just raised a Series A on the strength of their “AI-native development process” definitely is. This isn’t controversial anymore - it’s just the state of things.

What’s less obvious is how enormous the gap is between “using AI” and “using AI well.” I’ve watched teams adopt Copilot, Claude, ChatGPT, and the rest, and the pattern is remarkably consistent: a burst of productivity, a growing pile of code that nobody fully understands, and a slow, quiet accumulation of debt. Not the dramatic kind - the subtle kind. Slightly inconsistent naming. Almost-correct error handling. Tests that pass but don’t actually test the thing that matters. Code that looks right, works today, and will confuse someone six months from now.

Here’s the position I’ve landed on after using AI extensively across a number of projects - personal sites, open-source libraries, mobile games, monorepos, and multi-service platforms spanning Go, .NET, iOS, and Android: the practices that prevent AI from degrading your codebase are the same practices that make AI dramatically more effective. They’re not a tax on productivity. They’re a multiplier. Good engineering discipline doesn’t fight AI - it gives AI something to be good at.

The rest of this post is the specific patterns I’ve found that work.

Treat AI output as untrusted input

The most useful mental model I’ve found: AI is a junior engineer with infinite energy, broad knowledge, and no judgment. It will confidently produce working code that subtly violates your conventions, misses edge cases, or introduces patterns you’d never approve in a human’s PR. It does this while being articulate, fast, and polite - which makes it more dangerous than a human who’d at least hesitate before committing something they weren’t sure about.

The fix is simple in principle: every piece of AI output goes through the same quality gates as human work. No shortcuts. No “it’s just a small change.” No “the AI wrote it and it looks fine.”

The review-fix loop

In all of my projects, I use a structured review-fix-loop pattern. AI generates a change, an automated review scores it against explicit criteria, the AI fixes the issues, and the loop repeats until a quality bar is met. The key insight is that this isn’t a human reviewing AI - it’s structured AI reviewing AI, with humans setting the criteria.

The loop works like this:

  1. Generate - AI produces a change (a new feature, a refactor, a config update)
  2. Review - The change is reviewed against domain-specific checklists, producing findings categorized as blocker, warning, or nit
  3. Fix - AI addresses the findings, prioritizing blockers first, then warnings, nits only if time permits
  4. Re-review - Back to step 2
  5. Converge - Stop when the quality bar is met or max rounds are exhausted

Two constraints make this actually work instead of spiraling. First, a target rating prevents shipping mediocre work - you set a minimum acceptable quality level and the loop doesn’t stop until it’s reached. Second, a max rounds cap prevents infinite polishing - if three rounds haven’t fixed it, the remaining issues need human attention, not another AI pass.

# Review-fix loop parameters
max_rounds: 3        # Hard stop - prevents infinite iteration
min_rating: 3        # Minimum acceptable rating (out of 5 stars)
auto_fix: true       # AI attempts fixes between rounds

The “vibe coding” anti-pattern

The most dangerous AI output is the kind that’s 95% correct. It passes a quick visual scan. It compiles. It might even pass existing tests. But it’s subtly wrong in a way you won’t notice until it causes a problem in production or - worse - until the next person who reads it builds on a flawed assumption.

This is what I call “vibe coding” - accepting AI output because it feels right without verifying it is right. It’s the AI equivalent of rubber-stamping a PR because the diff looks clean. Every experienced engineer knows not to do that with a human’s code. The same discipline applies here.

Give AI context, not freedom

Most people’s experience with AI coding tools is: open a file, type a comment, accept whatever comes out. That’s the lowest-value way to use these tools. It’s like hiring an engineer, not telling them anything about your project, and then being surprised when their code doesn’t match your conventions.

The better approach: give AI a structured context that narrows its decision space. The less freedom it has, the better its output. This sounds counterintuitive - isn’t the whole point of AI that it’s creative? - but it mirrors how good engineering teams work. You don’t tell a new hire “build whatever you want.” You tell them “here’s our architecture, here’s our style guide, here’s the problem to solve.”

Three tiers of instruction

I’ve landed on a three-tier instruction architecture that works well across projects of very different sizes.

Tier 1: Project-level instructions. Most AI assistants now look for a project-level instructions file in a known location - .github/copilot-instructions.md for GitHub Copilot, CLAUDE.md for Claude Code, and the emerging cross-vendor AGENTS.md convention. This file sets repo-wide policy - what the project is, how it’s built, what conventions to follow. Every AI interaction in the repo starts with this context. Here’s a trimmed example from a Go + TypeScript monorepo:

## Coding Standards

### Go
- **Format with `gofumpt`** (not plain `gofmt`). All Go code must be
  `gofumpt`-clean.
- **Lint**: `golangci-lint run` must pass. Config is `.golangci.yml`.
- **Errors**: Always check. Wrap with `fmt.Errorf("context: %w", err)`.
  Never discard.
- **Context**: Every I/O function takes `ctx context.Context` as first
  argument.
- **Logging**: `slog` (stdlib structured logger). No `fmt.Printf` in
  service code.
- **Tests**: Table-driven (`t.Run`). Pattern:
  `TestFunctionName_Scenario_Expected`.
- **Generated code**: Never edit `gen/` by hand. Run `buf generate` to
  regenerate.

### Protobuf
- **`proto/` is the contract layer.** All changes to service APIs start
  with a `.proto` change.
- **Additive only within v1.** New fields/RPCs/enum values are fine.
  Removing or renumbering is a breaking change requiring `v2/`.
- **`buf lint`** (STANDARD ruleset) and **`buf breaking`** both run on
  every PR touching `proto/`.

### Commit messages
- Subject line **and** detailed body. The body must explain what
  changed, why, and any non-obvious trade-offs.
- One-line commits are never acceptable - not even for docs or chores.
- A reviewer reading `git log` must understand *why* without opening
  the diff.

That’s not a lot of text, but it prevents an entire category of mistakes. The AI knows the formatter to use, the error-handling pattern, the logging library, the test naming convention, and - crucially - which files are off-limits because they’re generated. It knows the contract layer is proto, so it won’t try to add an endpoint by editing handler code first. It knows that one-line commits aren’t acceptable, so it won’t produce them. Without this, the AI confidently writes idiomatic-looking code that subtly violates every one of these conventions, and you spend your review time pointing out the same issues over and over.

Tier 2: Agent-level personas. Instead of one general-purpose AI that “does everything,” I use specialized agents with clear boundaries. A multi-service platform might have six:

Agent Domain Key responsibilities
Backend Engineer Service implementation Go services, error handling, context propagation, structured logging
API Architect Contracts and boundaries Proto schema design, versioning, breaking change detection
Frontend Engineer Client surfaces TypeScript, React, state management, accessibility
DevOps Engineer Infrastructure Docker, K8s manifests, CI/CD pipelines, observability
Security Reviewer Auth, secrets, data AuthN/AuthZ flows, secret handling, privacy compliance
Database Engineer Data layer Schema migrations, query patterns, transaction boundaries

Each agent has a canonical spec defining its mission, operating style, behaviors to emulate, and explicit anti-patterns. The backend engineer agent knows that every I/O function takes a context as the first argument and that errors get wrapped with fmt.Errorf("op: %w", err). The API architect knows that proto changes within v1/ must be additive only - any field removal or renumbering requires a v2/. The security reviewer knows to always check authorization scope alongside authentication, because authenticated does not mean authorized.

The persona pattern deserves emphasis: a general-purpose AI that “does everything” produces worse output than a specialized agent operating within defined boundaries. This mirrors how real engineering teams work. You don’t ask your frontend engineer to review your Kubernetes manifests. You don’t ask your SRE to design your auth flow. Specialization produces quality.

This pattern scales in both directions. A personal blog or open-source library might only need three agents - writer, designer, infrastructure. The six-agent layout above is closer to a typical multi-service backend. Add game design, narrative design, or trust-and-safety roles for a consumer product, and you can justify eight or more. Specialize by domain, and each agent produces better output within its boundaries than a generalist would across all of them.

One pattern worth calling out specifically: contract-first governance. In a monorepo project, protocol buffer definitions are the source of truth - generated code is never edited by hand. The AI instructions encode this explicitly, preventing a whole class of mistakes where the AI “helpfully” modifies generated files. The rule is simple and absolute: if a file is generated, the AI touches the definition, not the output.

Tier 3: Task-level skills. Skills add workflow-specific gates on top of agent capabilities. A review-pr skill maps file patterns to specific checklists and routes them to the right specialist. Here’s a trimmed example of that mapping:

| File Pattern                          | Checklists       | Notes                          |
|---------------------------------------|------------------|--------------------------------|
| `*.go` (in `cmd/`, `internal/`, `pkg/`) | 03, 03a, 02, 04 | Backend service or systems code |
| `*.ts`, `*.tsx`                       | 03, 03b          | TypeScript and React code      |
| `*.proto`, `proto/**`, `gen/**`       | 02, 03           | Contract and boundary review   |
| `**/auth/**`, `**/security/**`, `.env*` | 05, 14         | Always apply security checklist |
| `**/payments/**`, `**/billing/**`     | 03, 04, 05, 07   | Payment flows: security + privacy |
| `Dockerfile*`, `k8s/**`, `helm/**`    | 10, 05           | Runtime environment + security |
| `.github/workflows/*`                 | 10, 05           | CI/CD security and runtime     |

The numbers point to specific checklists (03 = coding conventions, 05 = security, 10 = runtime environment, and so on). A change to a payments handler triggers four checklists. A change to a CI workflow triggers two. A change to a generated proto file gets blocked entirely because nothing in gen/ should be hand-edited. The skill turns “review this code” into “evaluate this code against these specific criteria” - which is dramatically more useful direction for an AI.

Make quality explicit and measurable

“Quality” is meaningless unless you define it. If you can’t tell an AI what “good” looks like in concrete terms, you can’t expect it to produce good work. This is equally true for human engineers, but the consequences are more acute with AI because it will confidently produce something regardless of whether it understands your quality bar.

Checklists beat vibes

The review skill maps every file type to a specific checklist. Here’s what Go service code gets checked against:

## Go Service Code Checklist
- [ ] Code is `gofumpt`-clean and passes `golangci-lint run`
- [ ] All errors are checked - none discarded with `_`
- [ ] Errors are wrapped with context: `fmt.Errorf("op: %w", err)`
- [ ] Every I/O function takes `ctx context.Context` as first argument
- [ ] No `fmt.Printf` in service code - structured logging via `slog`
- [ ] Tests are table-driven with `t.Run` subtests
- [ ] No edits to files under `gen/` (generated code)
- [ ] Public APIs have godoc comments
- [ ] No naked goroutines - every `go` statement has a clear lifecycle

Infrastructure changes get a different checklist:

## Infrastructure Checklist
- [ ] No secrets in source (API keys, tokens, passwords)
- [ ] Docker image uses specific tag (not :latest in production)
- [ ] Health check endpoints configured
- [ ] Resource limits set in K8s manifests
- [ ] Workflow permissions follow least-privilege

None of this is subjective. “All errors are checked” is a binary check - either every error is handled or it isn’t. “No secrets in source” is a binary check. “No edits to files under gen/” is a binary check. When quality criteria are explicit, AI can actually evaluate them reliably. When quality criteria are vague - “make sure the code is good” - you get vague results.

Rating rubrics remove ambiguity

The checklists feed into a rating rubric that defines five levels:

Rating Meaning Criteria
⭐⭐⭐⭐⭐ Exemplary No issues. Clean, well-structured, follows all conventions.
⭐⭐⭐⭐ Good Minor nits only. No functional or quality issues.
⭐⭐⭐ Acceptable Some warnings but no blockers. Mergeable with minor fixes.
⭐⭐ Needs Work One or more warnings that should be addressed before merge.
Blocker Critical issues (security exposure, broken builds, generated code edited). Must fix.

This removes the “LGTM” problem. Instead of a binary pass/fail, the review produces a rating that maps to a specific quality level. The review-fix loop has a minimum target rating. If the change doesn’t hit it, the loop continues. If it does, work stops. No ambiguity about what “good enough” means.

The anti-pattern here is the same one that plagues human code review: rubber-stamping. If you wouldn’t approve a human’s PR with a two-second glance, don’t do it with AI-generated code. The difference is that with AI, you can automate the rigor instead of relying on human discipline at review time.

Build convergence loops, not one-shots

Single-pass generation produces mediocre output. This is true for human writing too - first drafts are rarely publishable - but it’s especially true for AI because AI has no internal quality bar. It generates something that satisfies the prompt and moves on. The magic is in iteration.

How convergence actually works

The review-fix loop I described earlier isn’t just a quality gate - it’s a convergence mechanism. Each round narrows the gap between what was produced and what’s actually good:

  • Round 1: Gets the structure right. The broad strokes are correct but the details are sloppy - errors are checked but not wrapped, log statements use fmt.Printf instead of slog, a function signature is missing ctx context.Context.
  • Round 2: Catches the specific issues. Blockers and warnings from the first review get fixed. The output is meaningfully better.
  • Round 3: Polishes. Nits get addressed. The output converges on what you’d actually ship.

Most changes converge in 2-3 rounds. That’s been consistent across both code and infrastructure changes.

Convene a council, not a single reviewer

The review step doesn’t have to be one reviewer. The same persona agents from Tier 2 can be convened as a council - each one reviews the same change through its own lens, and their findings flow into the same blocker/warning/nit pile. A recent change I ran through this on a Go service is a good example:

  • The backend engineer flagged that the new handler kicked off a goroutine without honoring ctx.Done() - a leak waiting to happen under load.
  • The security reviewer flagged the same handler for accepting an authenticated request without checking that the caller’s scope actually covered the resource being mutated.
  • The DevOps engineer flagged that the new endpoint wasn’t covered by the readiness probe, so a half-initialized pod would happily start serving it.

No single reviewer would have caught all three. The loop only converges when no council member has remaining blockers, which raises the quality bar without piling more cognitive load on any one reviewer - human or AI. Each persona stays inside its lane, and the union of their lanes is the actual surface area of the change.

Front-loading thinking with structured pipelines

One pattern that’s dramatically improved first-pass quality: don’t let AI jump straight to code. In one project, I use a specification-first pipeline - analyze, specify, plan, task, implement, verify - where the AI goes through structured planning phases before writing a single line. The implementation step comes last, after the problem is well-defined and the approach is agreed upon. This front-loads the thinking and produces better output on the first pass, which means fewer rounds in the review-fix loop.

A related pattern: some projects auto-generate development guidelines from feature plans. The AI instructions aren’t static files you write once and forget - they evolve with the project. When a new feature area gets planned, the guidelines update to reflect new conventions, new boundaries, new domain knowledge. The AI’s context stays current because the system keeps it current.

Scope discipline prevents thrashing

The key constraint that makes iteration work instead of spiraling: fix only what was flagged, don’t refactor the world. Without this rule, the AI will “helpfully” improve things that weren’t in the review findings, which introduces new issues, which get flagged in the next review, which triggers more “helpful” improvements, and now you’re three rounds deep in changes nobody asked for.

The fix rules are explicit:

  • Fix only issues identified in the review
  • Preserve the author’s intent - don’t rewrite working logic that wasn’t flagged
  • For code: fix correctness and conventions (errors, context, logging) - do NOT restructure or “improve” untouched code
  • For infrastructure: prioritize security (secrets, permissions, network policies) over style

Severity prioritization matters too. Blockers get fixed first. Warnings second. Nits are optional. This prevents the AI from spending a round perfecting a comma placement while a secrets exposure goes unaddressed.

Multi-model review reduces blind spots

In higher-stakes projects - multi-service platforms with auth, tenancy boundaries, and real user data - I use multi-model review. Different AI models review the same change. Each model has its own blind spots and biases. One might miss a security issue that another catches. One might flag a style inconsistency that another considers fine. The consensus across models is more reliable than any single model’s opinion.

This sounds like overkill, and for a personal blog it is. But for a platform with real security boundaries? The cost of a missed security issue vastly exceeds the cost of an extra review pass.

Multi-model review and the council pattern are orthogonal - a council of personas can each run on multiple models, multiplying the diversity at each round.

Hard stops exist for a reason

Not everything is negotiable. Some guardrails are absolute, and no amount of AI confidence should override them.

What AI must never do

These rules are non-negotiable across all my projects:

  • AI must never merge its own PRs. A human approves the final result. Always.
  • AI must never push directly to main. Branch protection isn’t optional - and in some projects this is encoded directly in the AI instructions, not just the repo settings. AI works on branches, creates PRs, and waits for human approval.
  • Security issues are never dismissible. Exposed secrets, missing auth, unsafe permissions - these can’t be explained away regardless of how many review rounds have passed.
  • Broken functionality is never dismissible. Failing tests, invalid YAML, broken builds - if it’s broken, it’s blocked.
  • Generated code is never edited by hand. If a file is in gen/ or marked auto-generated, the AI changes the source definition (proto, schema, OpenAPI spec), not the output.
  • Contract changes are never silent. Changes to public APIs, proto definitions, or database schemas require explicit acknowledgment in the PR description and a design doc reference if the change is breaking.

The strictest projects go further. The review target isn’t 3/5 - it’s 5/5 from every reviewing agent, every PR, no exceptions. When the blast radius of a bad merge is “real users lose money” or “auth boundaries break,” the review bar goes up accordingly.

Dismissal rules make the boundary explicit

The review-fix loop allows some findings to be dismissed rather than fixed - but only certain categories. An intentional style deviation? Dismissable, if documented. A pre-existing issue not introduced by this PR? Dismissable, tracked separately. A security issue? Never dismissable. A broken template? Never dismissable.

## When NOT to Dismiss

Never dismiss:
1. Security issues - exposed secrets, missing auth, unsafe permissions
2. Broken functionality - failing tests, invalid YAML, broken builds
3. Generated code edits - never modify files under `gen/` by hand
4. Breaking contract changes without a design doc reference
5. Player or user data handling without privacy review

The principle is simple: AI operates within boundaries set by humans. The boundaries are non-negotiable. The work within those boundaries is where AI adds value. An AI that can do anything is an AI you can’t trust. An AI that does excellent work within well-defined constraints is genuinely useful.

CI/CD is the final gate

The build pipeline doesn’t care whether code was written by a human or an AI. Health checks still need to pass. The container still needs to build. The manifests still need to be valid. If the health check fails, the deploy fails. AI-generated code doesn’t get special treatment.

This is the safety net that catches everything else. If the review-fix loop misses something, if a checklist has a gap, if a human rubber-stamps a PR they shouldn’t have - the pipeline is the last line of defense. It’s the same pipeline you’d have without AI. That’s the point.

This scales from solo to team

I want to address the objection I know some readers are forming: “This sounds like a lot of overhead for writing code.” It’s not. These patterns scale down to a solo developer and up to a team. The overhead is proportional to the stakes.

The spectrum

Rather than a binary solo-vs-team split, I’ve found these practices work across a spectrum. The architecture is the same everywhere - the parameters change. Those parameters: how many agent personas, how strict the review target, how broad the checklists, and whether multi-model review is in play.

Solo and personal projects - an open-source library, a side project, a personal blog - use lightweight configurations. Three agent personas, a handful of skills, a 5/5 review target, and a CI pipeline that validates every change. The total setup cost is a few markdown files and a workflow. The payoff: consistent style across every change, automated quality checks that catch obvious mistakes before they ship, and a deploy process that validates health endpoints after every rollout. I don’t have to remember to check whether errors are wrapped or whether tests are table-driven - the checklist handles it.

Mid-size projects - mobile games, monorepos - need more structure. Five or six agent personas covering game design, platform engineering, and domain-specific concerns. Contract-first governance where protocol definitions are the source of truth and generated code is hands-off. Specification-first pipelines that front-load the thinking before implementation starts. The review bar holds at 5/5 with stricter dismissal rules, the checklists are broader, and the AI instructions evolve alongside the project as new feature areas emerge.

Platform projects - multi-service, multi-tenant systems - use the full toolbox. Eight or more specialized personas spanning everything from trust and safety to narrative design to platform product design. Multi-model consensus review catches blind spots that any single model would miss. The review target is 5/5 from every reviewing agent, every PR, no exceptions. Merge policies are explicit - AI agents can create PRs and push to feature branches, but only humans merge to main. Governance for AI tooling and Model Context Protocol (MCP) integrations ensures the system’s own tooling gets the same scrutiny as application code.

The project-level instructions file is the single highest-leverage thing I’ve set up across any of these. One markdown file that tells AI how the project works, and the quality of every interaction improves. If you take nothing else from this post, take that.

Start small

If you’re starting from zero, here’s the progression I’d recommend:

  1. Add a project instructions file. Drop one in the location your AI assistant looks for - .github/copilot-instructions.md for Copilot or CLAUDE.md for Claude Code. Put your project’s conventions in it - language, frameworks, style rules, deployment model. This takes 30 minutes and immediately improves AI output quality.
  2. Add one review checklist. Define what “good” looks like for the most common type of change in your repo. Even five items on a checklist gives the AI something concrete to evaluate against.
  3. Add agent personas when you notice drift. If AI-generated code starts mixing styles, or infrastructure changes start missing security basics, that’s when a specialized agent pays for itself.
  4. Add convergence loops when quality matters. For anything that ships to users - production code, deployment configs, public APIs - the review-fix loop catches what single-pass generation misses.

You don’t need all of this on day one. You need step 1 on day one. The rest follows as you find the need.

The punchline

Here’s what I keep coming back to: these aren’t AI-specific practices. They’re engineering practices - code review, checklists, deployment validation, separation of concerns, principle of least privilege - adapted for a new kind of contributor. The teams that struggle with AI are usually teams that were already struggling with engineering discipline. AI just accelerates whatever was already happening, for better or worse.

The best AI workflow is one where removing the AI wouldn’t break your process - it would just slow it down. If your system falls apart without AI, you’ve built a dependency. If it just gets faster with AI, you’ve built a tool. Every practice I’ve described in this post works without AI. The instructions file is useful for human onboarding. The review checklists are useful for human PRs. The convergence pattern is just “revise your work before shipping.” AI makes all of this faster. It doesn’t make any of it optional.

The irony is real: the teams that will get the most out of AI are the ones that already had good engineering discipline. AI doesn’t replace good practices. It rewards them.

If you want a practical starting point: add a project instructions file to your repo tomorrow - .github/copilot-instructions.md or CLAUDE.md, whichever fits your tooling. Put your project’s conventions in it - the architecture, the style rules, the things you find yourself repeating in PR reviews. See how the AI output changes. Iterate from there. That’s the whole approach in miniature - explicit quality criteria, structured context, and convergence through iteration. The rest is just tuning the parameters.