Guide September 9, 2024

How AI Code Review Actually Works (No Marketing Hype)

Most descriptions of AI code review are either oversimplified marketing or impenetrable research papers. Here is a straightforward, honest explanation of what actually happens, what the technology can do, and where it falls short.

A flowchart showing code files being sent to an AI model and structured review findings being returned, with transparent labels on each step

If you search for “AI code review” you will find two kinds of content. Marketing pages that promise AI will find every bug in your codebase and fix them automatically. And academic papers that discuss transformer architectures and attention mechanisms. Neither tells you what actually happens when you point an AI code review tool at your repository.

This post is an honest, demystified explanation. What the technology does. What it is good at. What it is not good at. No exaggeration, no hand-waving, no hype.

The basic mechanism

At its core, AI code review works like this: your source code is sent to a large language model (LLM), along with instructions that tell the model what to look for. The model reads the code, analyses it against the specified categories, and returns structured findings. That is the entire mechanism. Everything else is orchestration.

The LLM is a neural network trained on enormous quantities of text, including billions of lines of source code in every major programming language. During training, it learned patterns: how code is typically structured, what common vulnerabilities look like, how error handling is conventionally implemented, what naming conventions are standard in different languages and frameworks. When it reads your code, it applies these learned patterns to identify where your code deviates from best practices, contains potential issues, or could be improved.

The instructions – often called a system prompt or review prompt – tell the model what categories to evaluate. A well-designed tool does not simply say “review this code.” It specifies categories: security, error handling, architecture, performance, consistency, maintainability. For each category, the prompt defines what constitutes an issue, what severity levels to use, and how to structure the output. The quality of the prompt is one of the biggest differentiators between AI code review tools.

Orchestration: making it work at scale

An LLM has a context window – a limit on how much text it can process at once. Current models can handle tens of thousands of tokens, which is enough for individual files or small modules, but not for an entire codebase of hundreds of thousands of lines.

This is where orchestration comes in. A code review tool does not send your entire codebase to the model in a single request. It breaks the codebase into manageable chunks – individual files, groups of related files, or logical modules – and sends each chunk to the model separately. Some tools also include contextual information with each chunk: project structure, configuration files, shared type definitions, or related files that provide context for the file being reviewed.

The findings from each chunk are then aggregated, deduplicated, and sometimes re-evaluated. If the model flags the same pattern in twenty files, the tool might consolidate those into a single finding about a systemic issue rather than reporting twenty separate instances.

Orchestration also handles practical concerns: rate limiting, error recovery, progress tracking, and cost management. Reviewing a large codebase can require hundreds of API calls, each of which costs money and takes time. A good tool manages this process transparently so the user sees a single coherent report rather than the mechanical complexity of the underlying requests.

What AI code review can do

AI code review is genuinely good at several categories of analysis. These are not theoretical capabilities – they are things current tools reliably detect in real codebases.

Pattern detection. LLMs excel at recognising patterns they have seen during training. This includes security anti-patterns (SQL injection vectors, cross-site scripting vulnerabilities, insecure cryptographic usage), error handling anti-patterns (swallowed exceptions, missing error propagation, inconsistent error responses), and code organisation anti-patterns (God classes, circular dependencies, excessive coupling).

Consistency checking. An LLM can compare how different parts of a codebase handle the same concern and flag inconsistencies. If nine out of ten API endpoints validate input in a certain way and the tenth does not, the model will notice. If the codebase uses two different logging libraries or three different approaches to configuration, the model will flag that. This kind of cross-file consistency analysis is something human reviewers struggle with because it requires holding the entire codebase in working memory.

Security scanning. LLMs can trace data flow through code and identify points where untrusted input reaches sensitive operations without validation or sanitisation. They can recognise authentication and authorisation patterns and flag endpoints that are missing access controls. They can identify hardcoded secrets, insecure defaults, and permissions configurations that are too broad.

Architectural analysis. At the module level, AI review can evaluate separation of concerns, dependency direction, abstraction quality, and interface design. It can identify modules that have too many responsibilities, layers that bypass the intended architecture, and abstractions that do not abstract anything useful.

Documentation and naming. LLMs can evaluate whether function names accurately describe what the function does, whether comments are accurate relative to the code, and whether the code is self-documenting or requires explanation that is missing.

What AI code review cannot do

The honest assessment requires acknowledging the limitations. These are not minor caveats. They are fundamental constraints that every user should understand.

It cannot verify runtime behaviour. An LLM reads code statically. It does not execute the code, run tests, or observe the system in operation. It can predict what the code will do based on its understanding of the language semantics, but it cannot confirm that prediction. A function that looks correct in static analysis might fail at runtime due to environmental factors, timing issues, or interactions with external systems that are not visible in the source code.

It cannot understand your business context. The model does not know your business requirements, your user workflows, your regulatory constraints, or your strategic priorities. It can identify that a function does X, but it cannot evaluate whether X is the right thing to do for your specific business. If your discount logic is supposed to cap at 30% but the code allows 50%, the model might not flag it because the code is technically valid – it just does not match a requirement the model has never seen.

It cannot guarantee completeness. An LLM might miss issues. It might focus on the obvious problems in a file and overlook a subtler one. It might misunderstand a complex interaction between modules. It might produce a finding that is technically correct but irrelevant in your specific context. AI code review is probabilistic, not deterministic. Running the same review twice might produce slightly different results. This is fundamentally different from a linter, which will always produce the same output for the same input.

It can hallucinate. LLMs sometimes generate findings that are factually wrong. They might reference a function that does not exist, describe a vulnerability that is not present, or explain the code incorrectly. The rate of hallucination is lower in code review than in general-purpose tasks – because the code provides strong grounding for the model's analysis – but it is not zero. Every AI code review finding should be treated as a suggestion that requires human verification, not as a fact.

It cannot replace human judgement. The most important decisions in software engineering – whether to build or buy, whether to refactor or rewrite, whether a particular trade-off is acceptable given the project's constraints – require human judgement informed by domain expertise, business context, and strategic thinking. AI code review provides information. Humans make decisions.

How different tools differ

All AI code review tools use the same fundamental mechanism: send code to an LLM, get findings back. The differences are in the implementation details, and those details matter.

Scope. Some tools review individual pull requests. Others review entire codebases. PR-level review is useful for catching issues in new code. Codebase-level review is necessary for detecting systemic patterns, architectural inconsistencies, and accumulated debt that no single PR introduced.

Prompt quality. The instructions given to the model determine what it looks for and how it reports findings. Some tools use generic prompts that produce generic output. Others use carefully designed, category-specific prompts that produce structured, actionable findings. Prompt design is largely invisible to the user, but it is one of the biggest factors in the quality of the output.

Pricing model. Most AI code review tools charge per seat per month, which includes the cost of the model API calls. This bundles the orchestration software with the AI compute cost. The alternative is BYOK (Bring Your Own Key), where you use your own API subscription and the tool only charges for the software itself. BYOK gives you more cost transparency and avoids paying a markup on API costs you are already incurring.

Deployment model. Some tools are cloud-based: your code is uploaded to the vendor's servers for analysis. Others run locally: the tool runs on your machine and sends code directly to the AI provider without a vendor intermediary. For teams with data sensitivity requirements, the deployment model is often the deciding factor.

VibeRails: a transparent implementation

VibeRails takes a deliberately transparent approach to AI code review.

It uses the BYOK model. You provide your own Claude or Codex API key. VibeRails orchestrates the review process – breaking the codebase into chunks, sending each to the model with category-specific prompts, aggregating and structuring the findings – but the AI analysis is performed by your own subscription. You see exactly what provider is being used, you control the cost, and you are not paying a markup on API calls.

It runs as a desktop application. Your code is read from your local file system and sent directly to the AI provider. It does not pass through VibeRails servers. For teams that need to keep their source code within controlled boundaries, this is not just a convenience – it is a requirement.

The findings are structured into categories with severity levels, specific file locations, and explanations. The output format is designed for triage: you can quickly identify the critical issues, plan the medium-term improvements, and file the low-priority observations for later.

The pricing is per developer: $299 for a lifetime licence, or $19/mo monthly. This reflects the BYOK model: since VibeRails does not pay for your API usage, there is no AI markup in the licence cost. Each developer gets their own licence for their machine, and can use it as often as they want.

The honest summary

AI code review is a real technology with real capabilities and real limitations. It works by sending your code to an LLM, which analyses it against trained patterns and returns structured findings. It is good at pattern detection, consistency checking, security scanning, and architectural analysis. It cannot verify runtime behaviour, understand your business context, guarantee completeness, or replace human judgement.

Used correctly – as a supplement to human review, not a replacement for it – AI code review provides a consistent, scalable analysis layer that catches categories of issues that would otherwise go unnoticed until they cause problems. Used incorrectly – as an oracle that is always right, or as a substitute for thinking – it creates a different kind of false confidence.

The technology is valuable. The hype around it is not. Understanding the difference is the first step to using it effectively.