Guide November 11, 2024

Code Review Patterns for Microservices Communication

Each service passes its own review. The communication between them passes nobody's review. That is where the real failures hide.

A network diagram of interconnected microservices with highlighted communication paths between nodes

Microservices architecture distributes your system across independent services. It also distributes the problems. Each service has its own repository, its own team, and its own review process. Within a single service, code quality can be excellent. But the communication between services – the API contracts, the error propagation, the retry logic, the consistency guarantees – often receives no structured review at all.

This is where the most damaging failures occur. Not inside a service, but between services. A retry storm that takes down a downstream dependency. An API contract change that breaks three consumers silently. An eventual consistency bug that corrupts data across two services over the course of hours before anyone notices.

Reviewing microservices in isolation is necessary but not sufficient. The communication patterns between services deserve equal scrutiny.

API contract drift

API contracts are the agreements between services about what data they exchange and in what format. In theory, these contracts are well-defined and versioned. In practice, they drift.

Drift happens gradually. A producer adds an optional field to a response. A consumer starts depending on that field without the producer knowing. Another consumer ignores a newly required field in a request. The contract that was once clear and shared becomes a set of assumptions that differ between producer and consumer.

Reviewing a single service cannot catch contract drift. The producer's code looks correct – it produces a valid response. The consumer's code looks correct – it parses the expected response. The mismatch only becomes visible when you look at both sides simultaneously.

What to look for in review: compare the producer's response schema with the consumer's parsing logic. Check whether optional fields on the producer side are treated as required on the consumer side. Verify that enum values match between producer and consumer. Look for hardcoded assumptions about response structure that bypass schema validation.

Inconsistent error propagation

When Service A calls Service B and Service B returns an error, what does Service A do? The answer varies wildly across most microservices architectures, even within the same system.

Some services swallow errors and return a degraded response. Others propagate the error directly to the caller, leaking internal details. Some retry immediately. Others fail fast. Some return a generic 500 error regardless of the downstream failure. Others attempt to map downstream errors to meaningful responses but get the mapping wrong.

Inconsistent error propagation creates unpredictable behaviour for callers. A client calling the API gateway cannot reason about what an error means if different services behind the gateway handle errors in fundamentally different ways. Is a 503 transient? It depends on which service returned it. Is a 400 the client's fault? It might be the client's fault, or it might be a downstream service returning a 400 for an internal misconfiguration that got propagated upward.

What to look for in review: trace the error path from the deepest downstream service back to the client. Does each service in the chain handle errors consistently? Are downstream errors translated into meaningful responses at each layer? Are internal error details stripped before reaching external callers? Is there a shared error model, or does each service invent its own?

Retry storm risks

Retries are essential in distributed systems. Network calls fail transiently, and retrying often succeeds. But uncoordinated retries across multiple layers create retry storms that amplify failures instead of recovering from them.

The pattern is straightforward. Service A retries three times on failure. Service A calls Service B, which also retries three times on failure. Service B calls Service C. If Service C is under load and responding slowly, Service B sends 3 requests to Service C per original request. Service A sends 3 requests to Service B, each of which sends 3 requests to Service C. One user request becomes 9 requests to Service C. Scale this across multiple callers and you have a retry storm that converts a minor latency spike into a total outage.

Reviewing retry logic within a single service looks fine. Three retries with exponential backoff is a reasonable policy. The problem is that every service in the call chain has the same reasonable policy, and the multiplicative effect is unreasonable.

What to look for in review: map the retry policies at each layer of the call chain. Calculate the worst-case amplification factor. Check whether services use deadlines or timeouts that cascade properly – if the top-level request has a 30-second timeout, downstream retries should not collectively exceed that budget. Look for retry-on-timeout patterns without jitter, which cause synchronised retry waves.

Circuit breaker gaps

Circuit breakers prevent a failing downstream service from bringing down its callers. When a circuit breaker opens, the caller fails fast instead of waiting for a timeout, preserving its own resources and giving the downstream service time to recover.

The problem is not that teams do not know about circuit breakers. Most teams do. The problem is that circuit breakers are applied inconsistently. Service A has a circuit breaker on its call to Service B, but not on its call to Service C. Service D has circuit breakers on all external calls but not on its database connection. The message queue consumer has no circuit breaker at all.

Inconsistent circuit breaker coverage means that the service that lacks a circuit breaker becomes the weak link. It accumulates connections to a failing dependency, exhausts its connection pool, and then fails itself – cascading the failure to its own callers.

What to look for in review: catalogue every outbound call each service makes – HTTP calls, database connections, message queue operations, cache lookups. For each outbound call, verify that there is a circuit breaker or equivalent protection. Check that circuit breaker thresholds are calibrated to the dependency's SLA, not just set to defaults. Verify that the fallback behaviour when a circuit is open is tested and sensible.

Eventual consistency bugs

Microservices that maintain their own data stores achieve consistency through events, sagas, or choreography. This means there are windows – sometimes seconds, sometimes minutes – where different services have different views of the same data. During these windows, operations can produce incorrect results.

A common example: a user updates their email address in the user service. The order service has not yet received the event. An order confirmation is sent to the old email address. The user never receives it. No service did anything wrong individually. The bug is in the timing between them.

Eventual consistency bugs are particularly difficult to catch in review because they require reasoning about time and ordering across multiple codebases. Within a single service, the code handles events correctly. The bug emerges from the interaction between services during the consistency window.

What to look for in review: identify operations that depend on data from multiple services. For each operation, determine what happens if the data is temporarily inconsistent. Check whether there are compensating actions or idempotency guarantees for operations that may execute during a consistency window. Look for read-after-write patterns that assume immediate consistency across service boundaries.

Distributed transaction anti-patterns

Some operations need to succeed or fail atomically across multiple services. Debit one account and credit another. Reserve inventory and charge the customer. Create a user and send a welcome email.

The temptation is to simulate distributed transactions by making sequential calls and hoping they all succeed. This is the most common distributed transaction anti-pattern: the implicit two-phase commit that is not actually a two-phase commit.

The pattern looks like this: call Service A to debit the account, then call Service B to credit the recipient. If Service B fails, call Service A to reverse the debit. But what if the reversal call to Service A also fails? Now you have debited an account without crediting the recipient and without reversing the debit. The system is in an inconsistent state that requires manual intervention.

What to look for in review: identify operations that span multiple services and require atomicity. Check whether there is a saga pattern or equivalent coordination mechanism. Verify that compensating actions exist for every step that can fail. Ensure that compensating actions are idempotent – they should be safe to retry. Look for fire-and-forget patterns in operations that require confirmation.

Why isolated review misses these patterns

All of the patterns above share a common characteristic: they are invisible when you review a single service in isolation. Contract drift requires comparing producer and consumer. Retry storms require tracing the call chain. Consistency bugs require reasoning about timing across services.

PR review, by design, operates at the scope of a single change to a single repository. It is excellent for catching bugs within a service. It is structurally unable to catch bugs between services.

This is not a criticism of PR review. It is a recognition of its scope. PR review and cross-service review serve different purposes. Both are necessary. Most teams only do the first.

Full-codebase review across service boundaries

Catching cross-service issues requires a review approach that can see across service boundaries. This means scanning multiple services simultaneously and reasoning about their interactions.

VibeRails performs full-codebase scans that can include multiple services in a single review. Because it reads the entire codebase rather than individual changes, it can identify patterns that span service boundaries: mismatched API contracts, inconsistent error handling strategies, retry policies that multiply across call chains, and missing circuit breakers on critical dependencies.

This does not replace PR review within each service. It complements it by covering the territory that PR review cannot reach. Your service-level review ensures each service is well-built. Cross-service review ensures they work well together.

The most dangerous bugs in a microservices architecture are not the ones inside a service. They are the ones between services – in the communication patterns that nobody reviews because they do not belong to any single team's repository.

Limits and tradeoffs

It can miss context. Treat findings as prompts for investigation, not verdicts.
False positives happen. Plan a quick triage pass before you schedule work.
Privacy depends on your model setup. If you use a cloud model, relevant code is sent to that provider; local models can keep inference on your own hardware.