Analysis October 21, 2024

How Much Test Coverage Is Enough?

The answer is not a number. It is a strategy. And the strategy should be informed by what your code review findings tell you about where the real risk lives.

A coverage report dashboard alongside a risk matrix with sections highlighted in red and green

Every engineering team eventually has the test coverage conversation. Someone looks at the coverage report and asks the question: is this enough? The number stares back at them – 47%, or 72%, or 91% – and the team tries to decide whether they should feel good or bad about it. The conversation usually goes nowhere, because the question is wrong.

Test coverage as a percentage is a deeply misleading metric. It tells you how much of your code was executed during tests. It tells you nothing about whether the tests verified anything meaningful, whether the most important code paths are covered, or whether the tests would actually catch a real bug. Chasing a coverage number is one of the most common ways teams waste testing effort while leaving genuine risk areas unprotected.

Why 100% coverage is a bad goal

The appeal of 100% coverage is obvious. If every line of code is executed during tests, then every bug should be caught. The logic sounds airtight. It is not.

First, coverage measures execution, not verification. A test that calls a function and ignores the return value achieves coverage without testing anything. A test that asserts the output is not null achieves coverage while accepting almost any incorrect result. You can reach 100% coverage with tests that catch zero bugs.

Second, the cost curve is exponential. Getting from 0% to 60% coverage is relatively straightforward – you write tests for your main functions and happy paths. Getting from 60% to 80% requires testing error paths, edge cases, and less-frequently-used features. Getting from 80% to 100% requires testing generated code, configuration boilerplate, trivial getters and setters, and code paths that exist only for defensive programming. Each percentage point costs more than the last, and the value of each point decreases.

Third, 100% coverage creates false confidence. When the dashboard shows a green 100%, teams stop thinking about testing strategy. They assume everything is covered. But coverage says nothing about the quality of the assertions, the relevance of the test scenarios, or whether the tests reflect real-world usage patterns. A codebase with 100% coverage and weak assertions is less safe than one with 70% coverage and strong, targeted tests on the critical paths.

Why 0% is obviously worse

None of this means tests do not matter. A codebase with no tests is a codebase where every change is a gamble. Refactoring is terrifying because there is no way to verify that the refactoring preserved behaviour. Bug fixes are risky because there is no regression suite to catch unintended side effects. Deployments are stressful because the only test environment is production.

Zero coverage is not a principled stand against metric-chasing. It is an engineering liability. The question is not whether to write tests. It is which tests to write.

The right question: what should be tested?

Instead of asking how much coverage is enough, ask what the consequences of a bug would be in each part of the system. The answer varies dramatically.

A bug in your authentication flow could expose user data, cause regulatory violations, and destroy customer trust. A bug in the colour of a tooltip is an annoyance. Both are bugs. They do not deserve equal testing effort.

Risk-based testing allocates effort based on the cost of failure, not the volume of code. It starts with a simple question for each module: if this breaks in production, how bad is it?

Risk-based testing: where to focus

Four categories of code deserve the most testing attention.

Critical paths. These are the flows that your business depends on. User registration, payment processing, data export, authentication, and authorisation. A failure in any of these directly affects revenue, compliance, or user trust. These paths should have thorough unit tests, integration tests, and ideally end-to-end tests. They deserve high coverage not because coverage is the goal, but because the cost of failure is extreme.

Complex logic. Any code with high cyclomatic complexity – deeply nested conditionals, state machines, calculation engines, parsing logic – is statistically more likely to contain bugs. The more branches a function has, the more possible execution paths exist, and the harder it is to reason about correctness by reading the code alone. Complex logic needs tests because human review cannot reliably verify all the paths.

Edge cases and boundary conditions. Empty arrays, null inputs, maximum values, concurrent access, timezone boundaries, Unicode handling. These are the scenarios that developers think about for three seconds and then decide to handle later. They are also the scenarios that cause production incidents at 2am on a Saturday. Edge case testing is disproportionately valuable because edge cases are disproportionately likely to fail.

Integration points. Every place your code talks to an external system – a database, an API, a message queue, a file system – is a place where assumptions can be violated. The database might return rows in a different order than expected. The API might return a new field that breaks your parser. The file system might be read-only. Integration tests verify that your code handles real-world interactions, not just the idealised versions in your unit tests.

Coverage as a floor, not a ceiling

If coverage is not a goal, it can still be a useful guardrail. The way to use it is as a floor, not a ceiling.

A coverage floor means: we will not let coverage drop below this number. If a new commit decreases overall coverage, it must either include new tests or the team must acknowledge that untested code is being added. The floor is not aspirational. It is a ratchet that prevents regression.

A reasonable floor for most projects is somewhere between 60% and 80%, depending on the nature of the codebase. A data processing pipeline with complex transformation logic should have higher coverage than a CRUD application with mostly straightforward database operations. The number matters less than the principle: coverage can go up, but it should not go down without a conversation.

What the floor should not do is create incentives to write low-value tests. If a developer needs to bump coverage by 2% to pass the build, and they write trivial tests on getter methods to achieve it, the floor is doing more harm than good. The floor should trigger a conversation, not a mechanical response.

Using code review findings to identify untested risk

One of the most effective ways to identify where tests are missing is to look at code review findings. When a code review – human or AI – identifies a potential bug, a security vulnerability, or a logic error, the natural follow-up question is: would our tests have caught this?

If the answer is no, you have found a gap in your test suite. Not a theoretical gap based on coverage percentages, but a concrete gap based on a real issue that exists in your code today. This is far more valuable than chasing a coverage number because it connects testing effort directly to actual risk.

The pattern works in reverse too. When a code review finds no issues in a module, and that module also has low coverage, it does not necessarily mean the module is well-tested. It might mean the review did not probe deeply enough. Or it might mean the module is genuinely simple and low-risk. The combination of review findings and coverage data gives you a much richer picture than either metric alone.

VibeRails testing gap detection

When VibeRails analyses a codebase, it identifies modules where the combination of complexity, risk level, and test coverage suggests a testing gap. A high-complexity module handling sensitive data with low test coverage is flagged – not because the coverage number is below a threshold, but because the risk profile does not match the testing investment.

This is different from a coverage tool that simply reports percentages. Coverage tools tell you what is tested. VibeRails findings tell you what should be tested but is not. The difference is that one requires you to interpret a number, while the other highlights specific areas where your testing strategy has gaps relative to the risk.

The finding includes the module, the risk factors (complexity, data sensitivity, change frequency), and the current coverage level. This gives teams the information they need to make informed decisions about where to invest their next round of testing effort.

Stop chasing numbers, start managing risk

The test coverage conversation is stuck because teams are trying to answer a quantitative question with a qualitative problem. There is no magic number. There is no universal threshold. The right amount of test coverage depends entirely on what the code does, how likely it is to break, and how bad the consequences are when it does.

Write thorough tests for the code that matters most. Use coverage as a floor to prevent regression, not as a target to optimise for. Let code review findings guide you to the gaps that matter. And stop feeling guilty about the getter methods you did not test – the time is better spent writing a proper integration test for your payment processing flow.

Limits and tradeoffs

It can miss context. Treat findings as prompts for investigation, not verdicts.
False positives happen. Plan a quick triage pass before you schedule work.
Privacy depends on your model setup. If you use a cloud model, relevant code is sent to that provider; local models can keep inference on your own hardware.