Technical August 5, 2024

Code Review for AI-Generated Code

AI-generated code compiles, passes basic tests, and looks plausible. That is precisely what makes it dangerous to accept without thorough review. The failure modes are different from human-written code, and more subtle.

A code editor showing clean, well-formatted AI-generated code with subtle logical issues highlighted by review annotations

There is a common assumption that AI-generated code needs less review than human-written code. The reasoning is intuitive: an AI that has been trained on millions of repositories should produce code that follows best practices, avoids common mistakes, and adheres to established patterns. If the code looks clean and the tests pass, why scrutinise it?

This reasoning is backwards. AI-generated code needs more review, not less, precisely because its failure modes are different from and harder to detect than those of human-written code. Human-written code tends to fail in obvious ways – syntax errors, missing imports, clear logical bugs. AI-generated code tends to fail in subtle ways that look correct on the surface.

The plausibility problem

Large language models are optimised to produce plausible output. In the context of code generation, this means the output looks like real, working code. The variable names are reasonable. The structure follows familiar patterns. The syntax is correct. It reads like code that a competent developer would write.

But plausibility is not correctness. A function that sorts a list might use a well-known algorithm, handle the empty case, and return the right type – but compare elements using the wrong field. A database query might have proper joins, correct table names, and valid SQL syntax – but filter on a condition that produces subtly wrong results for a specific category of input.

These errors are difficult to catch through casual reading because the code looks right. The structure is sound. The naming is clear. The only problem is that the logic does not do exactly what was intended, in a way that might not surface until a specific edge case is encountered in production.

Human-written code with a logical error usually has other signals that something is off. The developer who wrote it may have left a comment expressing uncertainty. The surrounding code may show evidence of iteration – commented-out alternatives, TODO notes. The structure may be slightly awkward in a way that hints at a struggle with the logic. AI-generated code has none of these signals. It is uniformly confident, regardless of whether it is correct.

Missing context about your system

When a developer on your team writes code, they bring context that an AI does not have. They know that the user table uses soft deletes, so queries need a WHERE deleted_at IS NULL filter. They know that the payment service returns amounts in pence, not pounds. They know that the configuration system has a caching layer that means changes do not take effect immediately.

An AI coding assistant does not know these things unless they are explicitly stated in the prompt or visible in the immediate context window. It generates code based on general patterns, not specific knowledge of your system's conventions, constraints, and assumptions.

This produces code that is generically correct but specifically wrong. The AI writes a user query that works perfectly – except it returns soft-deleted users. It writes a price calculation that is mathematically sound – except it treats the values as pounds when they are actually pence. It writes a configuration update that takes effect immediately – except it does not, because of the cache.

A human reviewer who knows the system catches these issues immediately. They are obvious to anyone with context. But they are invisible in the code itself, which is why they pass automated tests (which also lack this context) and why they persist until they cause problems in production.

The edge case blind spot

AI coding assistants are remarkably good at generating the happy path. Given a description of what a function should do, they will produce a clean implementation that handles the primary use case correctly. What they frequently miss are the edge cases that a developer would consider through experience.

What happens when the input is null? What happens when the list is empty? What happens when the string contains unicode characters that are not in the basic multilingual plane? What happens when the timestamp crosses a daylight saving time boundary? What happens when the file system is full? What happens when the network request times out?

An experienced developer in your domain knows which edge cases matter because they have encountered them before. They know that the timezone handling is critical because there was a production incident about it last year. They know that the empty list case needs special treatment because the downstream service cannot handle empty payloads. This knowledge comes from experience with the specific system, not from general programming knowledge.

AI-generated code handles edge cases when the training data included similar edge case handling, or when the prompt explicitly mentions them. It does not proactively consider edge cases that are specific to your system, your data, or your users. This is not a limitation that better models will overcome. It is a fundamental gap between general knowledge and domain-specific experience.

Pattern inconsistency

Every codebase has conventions. How errors are handled. How configuration is accessed. How logging is structured. How database transactions are managed. These conventions may not be formally documented, but they are consistent enough that a developer working in the codebase internalises them.

AI-generated code follows the patterns it has seen most frequently in its training data, which may not match the patterns in your codebase. The result is code that works but does not fit. It uses a different error handling approach. It accesses configuration through a different mechanism. It logs in a different format.

Any single inconsistency is minor. But over time, as more AI-generated code enters the codebase, the inconsistencies accumulate. The codebase gradually develops multiple competing patterns for the same concern. This makes the system harder to understand, harder to maintain, and harder for new developers to learn.

This is particularly insidious because each piece of AI-generated code is internally consistent. It follows a pattern. Just not the same pattern as the rest of the codebase. A reviewer who looks at the code in isolation might approve it. Only a reviewer who knows the codebase conventions will catch the inconsistency.

The volume problem

AI coding assistants increase the rate at which code is produced. This is their primary value proposition: developers write more code faster. But the review process has not sped up correspondingly.

When a developer writes 200 lines of code in a day, reviewing that code is manageable. When the same developer produces 800 lines with AI assistance, the review workload quadruples. But the reviewer's available time has not changed. Something must give: either reviews take longer (creating the delay costs discussed elsewhere), reviews become shallower, or the review process is bypassed for some changes.

All three outcomes are problematic. Longer reviews delay the development pipeline. Shallower reviews miss the subtle issues that AI-generated code is prone to. Bypassed reviews allow unvetted code into the codebase. The increased production velocity that AI provides creates a review bottleneck that, if unaddressed, undermines the quality benefits of review.

This is the core tension. The same technology that makes it easier to write code makes it harder to review code, because there is simply more of it. Addressing this tension requires either scaling the review process – through AI-assisted review, automated baseline checks, or dedicated review capacity – or accepting that a larger portion of the codebase is entering production without adequate scrutiny.

What to look for when reviewing AI-generated code

Reviewing AI-generated code effectively requires a different emphasis than reviewing human-written code. Here is what to focus on.

Verify the logic, not just the structure. AI-generated code will have clean structure. Do not let that distract you. Read the logic line by line. Does this comparison use the right operator? Does this loop terminate correctly? Does this conditional cover all the cases it should?

Check system-specific constraints. Does the code respect your codebase's conventions? Does it handle your system's specific quirks – soft deletes, currency units, caching behaviour, timezone rules? These are the things the AI does not know about.

Test the edge cases explicitly. Do not assume the AI has handled them. Run through the edge cases that matter for your domain: null inputs, empty collections, boundary values, concurrent access, network failures. If the code does not handle them, it needs to.

Verify consistency with existing patterns. Does the code use the same error handling approach as the rest of the module? The same logging format? The same configuration access pattern? Inconsistencies that seem minor in isolation become significant maintenance burdens at scale.

Question the dependencies. AI-generated code sometimes imports libraries or uses APIs that are not part of your project's dependency set. Occasionally, it references libraries that do not exist at all – a well-known hallucination pattern. Verify that every import resolves to a real, approved dependency.

Check for security implications. AI-generated code may not follow your security practices. Does it sanitise user input? Does it use parameterised queries? Does it handle authentication tokens correctly? Does it log sensitive data? Security review should be applied with particular rigour to AI-generated code because the AI does not know your threat model.

The paradox of AI-generated code quality

There is a paradox at the heart of AI code generation. The code looks good enough that developers trust it more than they should, but it lacks the domain-specific context that makes code truly correct. It occupies an uncanny valley of quality: too good to obviously need review, not good enough to safely skip review.

The appropriate response is not to reject AI code generation. It is to recognise that the review process must adapt to the characteristics of AI-generated code. The old review heuristics – scan for obvious bugs, check the structure, verify the tests pass – are insufficient. AI-generated code passes those checks easily. The issues are deeper: wrong logic that looks right, missing context that is not visible in the code, edge cases that were never considered, patterns that do not match the codebase.

Teams that are increasing their use of AI coding assistants need to simultaneously increase the rigour of their review process. Not decrease it. The faster code is produced, the more important it becomes to verify that the code is correct, consistent, and safe. AI code review tools can help by providing automated analysis that catches the systematic issues – pattern inconsistencies, security gaps, missing error handling – while human reviewers focus on the domain-specific validation that requires contextual knowledge.

The goal is not to slow down AI-assisted development. It is to ensure that the speed gain in code production does not come at the expense of code quality. And that requires treating AI-generated code not as pre-vetted output that needs a quick glance, but as unreviewed code that happens to look unusually clean.