Code Review Automation: What to Automate and What to Keep Human

Trying to automate all of code review is a mistake. Keeping it all manual is also a mistake. Here is a three-tier framework for getting the balance right.

Three-tier review pipeline diagram showing automation, assisted review, and human judgement beside blurred CI output and a checklist

The promise of automation in code review is compelling. Machines do not get tired. They do not have off days. They can analyse a million lines in the time it takes a human to read a hundred. So the obvious conclusion is: automate everything.

The obvious conclusion is wrong. Not because automation is bad, but because code review is not one activity. It is several different activities that happen to share a name, and they have different automation profiles.

Some parts of code review should be fully automated. Some should be automated with human oversight. And some should remain entirely human. Knowing which is which is the difference between a review process that works and one that generates noise, misses important problems, or both.


Tier 1: Automate completely

Some review tasks are mechanical, deterministic, and high-volume. They follow fixed rules. They have clear right and wrong answers. A human performing these tasks is wasting their attention on something a machine can do better.

Formatting and style. Indentation, bracket placement, line length, import ordering – these should never consume a second of human review time. Configure a formatter (Prettier, Black, gofmt, whatever your language provides), enforce it in CI, and never discuss tabs versus spaces in a code review again. If your team is still leaving comments about formatting in pull requests, you have a tooling problem, not a review problem.

Linting. Unused variables, unreachable code, missing return statements, shadowed declarations. These are syntactic and semantic checks with deterministic answers. Linters catch them faster and more consistently than any human reviewer. Run them automatically and block merges on failures.

Simple security patterns. Hardcoded secrets, SQL strings built with concatenation, HTTP used where HTTPS is required, known vulnerable dependency versions. These checks are pattern-based and can be fully automated with tools like Semgrep, Bandit, or dependency scanners. They do not require judgement. They require detection.

Type checking. If your language has a type system, use it. TypeScript strict mode, mypy, the Go compiler – these catch entire categories of errors that human reviewers would need to trace manually. Type errors are not a matter of opinion.

The principle for Tier 1 is simple: if the check can be expressed as a deterministic rule with no exceptions, automate it completely and remove it from human review. Every minute a human spends checking formatting is a minute they are not spending on things that actually require human judgement.


Tier 2: Automate with human review

Some review tasks benefit from automation but cannot be fully delegated to machines. The automation generates findings, but a human must evaluate whether those findings are correct, relevant, and worth acting on.

AI-generated code quality findings. This is where tools like VibeRails operate. An AI model can read your codebase and identify patterns that suggest problems: inconsistent error handling across modules, duplicated business logic, overly complex functions, missing input validation in non-obvious places. These findings are often valuable, but they require human triage. The AI might flag a complex function that is intentionally complex because the domain demands it. It might identify duplicated code that was deliberately duplicated for isolation between services. Human context is needed to separate the true positives from the noise.

Dependency audits. Automated tools can identify outdated dependencies, known CVEs, and licence compatibility issues. But deciding what to do about them requires judgement. A critical CVE in a library you use in production is urgent. The same CVE in a library used only in a development script is not. An outdated dependency that works perfectly might not be worth the risk of upgrading if the new version introduces breaking changes.

Pattern detection across the codebase. Automated analysis can find inconsistencies – three different approaches to configuration loading, two incompatible logging strategies, authentication checks present in some endpoints but missing in others. Detecting these patterns at scale is something machines do well. But deciding which inconsistency matters, which pattern should be the standard, and what the migration path looks like requires human understanding of the system's history and trajectory.

Performance analysis. Tools can flag potential performance issues – N+1 queries, unbounded list operations, missing indices. But whether these are actual problems depends on context. An N+1 query in a batch job that runs once a day at 3am is not the same as an N+1 query in a user-facing API called thousands of times per minute. The automation surfaces the issue. The human evaluates the impact.

The principle for Tier 2: automation does the finding, humans do the deciding. This division is efficient because finding problems at scale is what machines do best, and evaluating importance in context is what humans do best. Trying to make either side do both jobs produces worse outcomes than the division.


Tier 3: Keep fully human

Some aspects of code review cannot be meaningfully automated, even with human oversight on the output. They require understanding, experience, and contextual knowledge that machines do not possess.

Architecture decisions. Is this the right abstraction? Should this be a separate service or a module within the existing service? Does this data model support the features we plan to build next quarter? These questions require understanding the product roadmap, the team's capabilities, the system's operational constraints, and the trade-offs between competing approaches. No automated tool can answer them.

Business logic validation. Does this implementation correctly reflect the business rules? If the requirement says “users can cancel within 30 days,” does the code handle time zones correctly? Does it account for partial refunds? Does it match the legal terms? Validating business logic requires understanding the business – something that lives in conversations, documents, and people's heads, not in the codebase.

Team knowledge transfer. One of the most valuable functions of code review is that it spreads knowledge across the team. When a senior developer reviews a junior developer's code, the comments are not just about correctness. They are about approach, idiom, and the reasons behind patterns. This mentoring function cannot be automated. An AI can say that a function is too complex. Only a human colleague can explain why, suggest a better approach based on how the team does things, and help the author grow.

Naming and domain language. Is this variable name clear? Does it use the team's shared vocabulary? Does the module name reflect what the module actually does? Naming is one of the hardest problems in software because it requires compressing a complex concept into a word or phrase that other humans will understand. Automated tools can flag names that violate conventions (too short, wrong case). They cannot evaluate whether a name communicates the right concept.

Risk assessment. Some changes look small but carry large risk because they touch critical paths, payment processing, or data integrity constraints. Evaluating whether a change is risky requires understanding the system's failure modes, its history of incidents, and the downstream effects that are not visible in the diff. This kind of assessment is deeply human.


Why trying to automate everything fails

Teams that attempt to automate all of code review typically end up in one of two failure modes.

The first is alert fatigue. When every finding from every tool is presented as equally important, developers stop paying attention. The genuinely critical issues get buried alongside hundreds of style nitpicks and low-confidence suggestions. The team starts ignoring the automated output entirely, which means the automation is worse than useless – it consumes time without providing value.

The second is false confidence. When automated tools give a green light, teams assume the code is fine. But the tools only checked what they can check – the Tier 1 and some Tier 2 items. The architecture might be wrong. The business logic might be flawed. The code might be clear to a machine and incomprehensible to a human. A clean automated report does not mean the code is good. It means the code passed the automated checks.

The three-tier model avoids both failure modes. Tier 1 runs silently and blocks bad code before humans see it. Tier 2 surfaces findings for human triage, reducing the volume while preserving the signal. Tier 3 stays entirely with the humans who have the context to evaluate it.


Putting the tiers together

In practice, the three tiers operate in sequence. Tier 1 tools run in CI on every commit. They catch the mechanical issues and enforce baseline standards. By the time a pull request reaches a human reviewer, the formatting is correct, the linter is clean, the type checker is satisfied, and the obvious security patterns are clear.

Tier 2 tools run periodically or on demand. VibeRails sits in this tier – it analyses your full codebase, generates findings about patterns, inconsistencies, and potential issues, and presents them for human triage. The team decides which findings are actionable and which are acceptable trade-offs.

Tier 3 happens in the pull request itself, in design reviews, and in pairing sessions. Human reviewers focus on the things that only humans can evaluate: the architecture, the business logic, the naming, the knowledge transfer. Because Tier 1 and Tier 2 have already handled the mechanical and pattern-based work, the human reviewers can spend their attention on the things that matter most.

This is not a new idea. It is how well-run engineering teams have always operated. The difference now is that Tier 2 is becoming practical for the first time, thanks to AI tools that can read and reason about code at scale. The teams that figure out how to use that middle tier effectively – without either over-relying on it or ignoring it – will have a meaningful advantage.