Opinion December 2, 2024

Code Review in the Age of AI-Generated Code

AI coding assistants have changed how code is written. They have not changed the need for review. If anything, they have made it more important – and more difficult.

The output of a single developer has changed dramatically. With AI coding assistants – Copilot, Cursor, Claude Code, Codex CLI – a developer can produce in hours what used to take days. Functions are generated. Boilerplate is eliminated. Entire modules materialise from a prompt.

This is a genuine productivity gain. More code is written, faster. But code that is written must also be reviewed, tested, deployed, and maintained. And while AI has accelerated the writing, it has not accelerated the reviewing. The bottleneck has shifted.

In teams that adopted AI coding assistants, the volume of code produced per developer per sprint has increased. But the number of experienced reviewers has not. The review queue is longer. The PRs are larger. And the code itself has characteristics that make it harder to review, not easier.

The review bottleneck

Before AI coding assistants, the constraint in most teams was writing speed. Developers could review roughly as fast as they could write, because both activities required the same skills and operated at similar speeds. A team of five developers, each writing and reviewing code, maintained a rough equilibrium.

AI assistants have broken that equilibrium. A developer with an AI assistant writes code two to five times faster than before. But reviewing code has not become two to five times faster. Reviewing still requires a human to read the code, understand what it does, verify that it does the right thing, check for edge cases, and assess architectural fit. None of these steps have been automated.

The result is a growing imbalance. More code enters the review queue than exits it. PRs wait longer. Reviewers are pressured to review faster, which means they review less carefully. The quality bar drops not because anyone decided to lower it, but because the maths no longer works.

Some teams respond by reducing review requirements – allowing self-approval for smaller changes, reducing the number of required reviewers, or skipping review entirely for AI-generated code on the assumption that the AI got it right. This is understandable. It is also dangerous.

The specific patterns of AI-generated code

AI-generated code is not the same as human-written code. It has distinct characteristics that affect how it should be reviewed.

Syntactically correct but semantically wrong. AI models are excellent at producing code that compiles, passes type checks, and looks correct at a glance. The syntax is always clean. The formatting is always consistent. The variable names are always reasonable. This surface-level correctness makes errors harder to spot. A human reviewer scanning AI-generated code is likely to trust it more than they should because it looks professional and deliberate. But the AI may have misunderstood the requirement. The function may return the wrong result for edge cases the AI did not consider. The algorithm may be correct in theory but wrong for the specific domain constraints of the project.

Confident-looking but possibly hallucinated. AI coding assistants do not express uncertainty. They produce code with the same confidence whether they are implementing a well-documented pattern or inventing an API that does not exist. A generated function call to database.upsertWithConflictResolution() looks plausible, reads well, and may not exist in the library being used. The AI does not flag this as uncertain. The reviewer must verify it.

This is a fundamental asymmetry. Human developers signal uncertainty through comments, questions in PRs, and tentative phrasing. AI-generated code is always confident. The reviewer cannot use the code's tone as a signal for where to focus attention.

Locally correct but globally inconsistent. AI coding assistants work within a context window. They see the current file, perhaps a few related files, and the prompt. They do not see the full codebase. They do not know that the project already has a date formatting utility in a shared module. They do not know that error handling follows a specific pattern established two years ago. They do not know that the team decided last month to deprecate a particular library in favour of a replacement.

As a result, AI-generated code is often correct in isolation but inconsistent with the rest of the project. It introduces a new pattern where an existing one should have been used. It adds a dependency that the team is trying to remove. It implements a solution that works but contradicts the architectural direction.

These inconsistencies are the hardest problems to catch in review because each individual piece of code is defensible on its own. The problem is only visible when you consider the codebase as a whole.

Why human review of AI code is more important, not less

There is a tempting logic that goes like this: the AI is smart, the code compiles, the tests pass, so the code is probably fine. This logic is wrong for two reasons.

First, the AI does not understand your system. It understands code in general, not your code in particular. It does not know your business rules, your performance constraints, your security requirements, or your architectural principles. A human reviewer does – or at least has access to the context needed to evaluate these things.

Second, the characteristics of AI-generated code – surface-level correctness, uniform confidence, local optimality – are precisely the characteristics that evade superficial review. If you were going to reduce review rigour for any category of code, AI-generated code would be the worst choice. It is the code that most benefits from careful, context-aware human scrutiny.

The appropriate response to AI-generated code is not less review but different review. Reviewers need to focus less on syntax and formatting (the AI handles these well) and more on semantics, architecture, and consistency. Does this code do what the requirement specifies? Does it handle the edge cases that matter for this domain? Does it fit with the rest of the codebase? Is it using the right patterns and the right dependencies?

The scale problem

Asking reviewers to be more thorough while the volume of code increases is not a sustainable solution. The maths does not work. If each developer produces three times more code and each review needs to be more careful, the review capacity gap widens until something breaks.

The only viable solution is to augment human review capacity with automated tools that can handle the scale. This does not mean replacing human reviewers. It means giving them a first pass that handles the mechanical checks – consistency with existing patterns, dependency correctness, error handling completeness, architectural alignment – so that human reviewers can focus their limited time on the judgement-intensive work that only humans can do.

This is particularly important for teams that have adopted AI coding assistants aggressively. If a significant portion of your codebase was written or significantly modified by AI, you need to audit it at the codebase level, not just at the PR level. Individual PRs may have been reviewed, but the cumulative effect of hundreds of AI-generated changes – the introduced inconsistencies, the duplicated patterns, the hallucinated dependencies – is only visible when you look at the whole.

What a codebase-level review catches

PR-level review catches local problems. Codebase-level review catches systemic ones. In codebases with significant AI-generated content, the systemic problems are often the most important.

Pattern inconsistency. The AI generated three different approaches to pagination across six endpoints. Each approach works, but the inconsistency means the front-end team must handle three different pagination contracts.

Dependency sprawl. The AI added four different HTTP client libraries across different modules because each prompt happened to produce code using a different library. The project now has redundant dependencies that increase bundle size, complicate updates, and create confusion about which library is canonical.

Architectural drift. Over six months of AI-assisted development, the codebase has gradually shifted from the intended architecture. Business logic has crept into controller layers. Data access patterns are inconsistent. The service boundaries are blurred. No individual PR violated the architecture, but the cumulative effect is significant.

Security blind spots. The AI implemented authentication correctly in the main user flow but missed it in an admin endpoint it generated three months later. The AI did not know the admin endpoint existed when it generated the user flow, and it did not know about the user flow when it generated the admin endpoint. Neither PR showed a problem. The codebase-level view reveals that authentication coverage is inconsistent.

The path forward

AI coding assistants are here to stay. They make developers more productive. The code they produce is often good. But the review process has not kept pace with the production rate, and the specific characteristics of AI-generated code create new categories of risk that demand new approaches to review.

Teams that adopted AI writing tools without adjusting their review processes will accumulate technical debt faster than teams that did not use AI at all – not because the AI writes bad code, but because more code with the same review capacity means less review per line.

VibeRails provides codebase-level AI code review that complements PR-level review. It analyses the entire codebase as a coherent system, identifying the consistency gaps, architectural drift, and pattern conflicts that PR-level review cannot see. For teams working with AI-generated codebases, it provides the scale of review that the scale of production demands.

Limits and tradeoffs

It can miss context. Treat findings as prompts for investigation, not verdicts.
False positives happen. Plan a quick triage pass before you schedule work.
Privacy depends on your model setup. If you use a cloud model, relevant code is sent to that provider; local models can keep inference on your own hardware.