Local AI Code Review – Run Models on Your Own Hardware

What local AI code review means

Local AI code review means running an AI model on hardware you control – your workstation, a server in your rack, or a GPU instance in your own cloud VPC – instead of sending code to a third-party API endpoint over the internet. The model processes your source code, produces findings, and returns results without any data leaving your network boundary.

Before diving into the setup, it helps to understand a few terms that come up frequently when working with local models:

Quantization – reducing the numerical precision of a model's weights to use less memory. A model trained at 16-bit floating point (FP16) can often be quantized to 4-bit (Q4), which typically reduces weight memory to roughly one quarter of FP16. Q4 is a common sweet spot for local deployments: it dramatically reduces memory use with a quality trade-off that is often acceptable for code review (but should be tested on your codebase).
Tokens per second (tok/s) – the generation speed of the model. This measures how fast the model produces output text. For code review, you typically need the model to generate several thousand tokens per file. Higher tok/s means faster reviews, but as we will discuss, speed is less important than you might think for batch review workflows.
MoE (Mixture of Experts) – a model architecture where only a fraction of the total parameters are active for any given token. A model with 200 billion total parameters but a MoE architecture might only activate 10 billion parameters per token, meaning it needs far less compute and VRAM than its total parameter count suggests. MoE models offer strong capability at lower hardware requirements.
VRAM – the memory on your GPU (or unified memory on Apple Silicon). The model weights must fit in VRAM to run inference. If the model is larger than your available VRAM, it will either fail to load or spill to system RAM, which is dramatically slower. VRAM is the primary constraint for local model selection.
Prefill vs generation – two distinct phases of model inference. Prefill is reading and processing the input (your source code). Generation is producing the output (the review findings). For code review workloads, inputs can be large; depending on your model and hardware, either phase can dominate wall-clock time.

The core value proposition is straightforward: VibeRails combined with a local model can let you run AI-assisted code review without sending code to a third-party AI API. In practice, this works by routing Claude Code's API traffic to an Anthropic-compatible endpoint you control (for example, an Ollama server on localhost, or a model server inside a private VPC). From VibeRails' perspective, the review workflow is the same; where the code goes is determined by where the CLI sends requests and by your network egress controls.

How it works technically

The data flow for local AI code review through VibeRails has five steps, all of which happen within your network:

VibeRails spawns the Claude Code CLI – when you start a review, VibeRails launches the claude CLI process with the review prompt and your project files, exactly as it would for a cloud API review.
The CLI reads the ANTHROPIC_BASE_URL environment variable – instead of sending requests to api.anthropic.com, the CLI sends them to your local model server (for example, http://localhost:11434 for Ollama).
The local model server runs inference – Ollama, vLLM, or llama.cpp receives the request and runs the model on your local GPU. The model reads your source code, reasons about it, and produces review findings.
The CLI receives the response – the model's output streams back to the CLI over localhost. No network traffic leaves your machine.
VibeRails processes the structured output – VibeRails consumes Claude Code CLI's stream-json output. As long as you're using the same CLI, the output format VibeRails expects stays consistent even if the CLI is routing requests to a different endpoint.

A typical local setup is a small set of environment variables (exact requirements can vary by CLI version and server, but this works with Ollama's Anthropic-compatible endpoint):

export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"

ANTHROPIC_BASE_URL tells the CLI where to send requests. ANTHROPIC_AUTH_TOKEN is required by Claude Code even when the endpoint is local; Ollama accepts any value here. If you're using a proxy or a hosted Anthropic-compatible endpoint, this token may be real authentication.

For a current, working reference configuration (including model names), see Ollama's Claude Code guide.

Model-name compatibility note. Claude Code passes a model identifier in each request. Your local Anthropic-compatible server must accept that identifier (or map it to the local model you want to run). In VibeRails, you can set a Custom Claude Model ID in Settings to pass a local model name directly (for example qwen3-coder). If your local server is strict about accepted model IDs, use a compatibility proxy (for example, LiteLLM) to map incoming model IDs to your local model.

Recommended models (February 2026)

The local model landscape changes rapidly. Rather than hard-coding benchmark scores (which change with evaluation scaffolds and new releases), here is a practical way to pick a starting model for code review:

Fastest start: pick a modern coding model that runs comfortably on your hardware and supports long context.
Better cross-file reasoning: use a larger model (or a mixture-of-experts coding model) if you have the VRAM and you care about architectural findings that depend on multiple modules.
Operational simplicity: prefer models that behave consistently under tool constraints (structured output, low temperature, predictable formatting). For VibeRails workflows, consistency often matters more than raw benchmark performance.

If you want to compare model capability, use a public, reproducible benchmark like SWE-bench Verified as one signal, and validate on your own repositories as the final arbiter.

Last updated: February 2026. Check model leaderboards and vendor documentation for the latest model names and packaging.

Hardware tiers

The hardware you need depends on the model you want to run and the context length you need. Here are three practical tiers, focused on what matters most for local inference: memory capacity.

Budget (single workstation GPU, ~24 GB VRAM). A modern consumer GPU with ~24 GB VRAM (for example, an RTX 4090-class card) can run smaller-to-mid sized coding models at Q4 quantization. This tier is enough to get real value from overnight batch reviews on legacy codebases.
Mid (large unified memory or larger VRAM). Hardware with a larger memory pool (for example, a Mac Studio configuration with up to 128 GB unified memory, or a workstation GPU with more VRAM) unlocks larger models and longer contexts. For code review, this tier is often about fit (can the model load) rather than speed.
High (large memory / multi-GPU). A Mac Studio with 512 GB unified memory, or a multi-GPU server, gives you headroom to run larger open-weight models and experiment with higher precision. This tier is for teams that want the best possible local quality and have a compliance reason to keep inference inside their boundary.

Here is the critical insight for code review specifically: speed often doesn't matter. The primary use case for local AI code review is batch processing: start a review at 6pm, come back to a complete bug report at 9am. If you're not waiting on the result interactively, you can trade latency for data sovereignty without sacrificing the workflow.

This changes the hardware calculus. Instead of optimizing for tokens-per-second (which favors expensive multi-GPU setups), you optimize for memory capacity (which determines what models you can run at all).

Step-by-step setup: Desktop

This walkthrough uses Ollama, which provides an Anthropic-compatible endpoint that Claude Code can talk to. The process takes under ten minutes if you already have a supported GPU.

# 1. Install Ollama (see ollama.com for platform-specific options)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a coding model (example; model names change over time)
ollama pull qwen3-coder

# 3. Configure Claude Code to talk to Ollama's endpoint
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"

# 4. Smoke test the connection
claude -p "Reply with only: OK" --model qwen3-coder

Step 1 installs Ollama, which manages model downloads and serves a local API endpoint. On macOS, you can alternatively install via Homebrew (brew install ollama) or download the desktop app from ollama.com.

Step 2 downloads the model weights. This is a one-time operation: weights are cached locally and reused across runs.

Step 3 configures Claude Code to route requests to Ollama instead of a cloud endpoint. You can add these environment variables to your shell profile (~/.bashrc, ~/.zshrc) to make them persistent, or set them only when you want to run local reviews.

Step 4 confirms your local endpoint is working before you run a long review.

Run the review. Open VibeRails, select your project, and start a review. The key requirement is that the model ID VibeRails requests (via Claude Code) must be accepted by your local endpoint (or mapped by a compatibility proxy).

Alternative model servers. vLLM and llama.cpp can be used for local inference as well. Depending on the server you choose, you may need a compatibility layer that implements the Anthropic Messages API and maps model identifiers.

Step-by-step setup: Cloud GPU (AWS)

For organisations that need air-gap guarantees but cannot justify capital expenditure on dedicated hardware, cloud GPU instances offer a pay-per-use alternative. The key is configuring the cloud environment so that no data leaves your VPC – the AI inference happens on a GPU instance inside your network boundary, not on a third-party API endpoint.

Recommended instance types (pricing varies by region and date):

EC2 g6.xlarge (NVIDIA L4, 24 GB VRAM) – suitable for smaller models and aggressive quantization.
EC2 g5.xlarge (NVIDIA A10G, 24 GB VRAM) – another option in the same VRAM class.
P5 (H100, 80 GB VRAM) – for higher-quality runs with 70B-class models (often quantized), when you need more VRAM and faster throughput.

Network-isolated VPC architecture. The goal is to create a cloud environment where GPU instances can run inference without any path to the public internet:

Private VPC with no NAT gateway. Create a VPC with private subnets only. Without a NAT gateway or internet gateway, instances in this VPC have zero outbound internet connectivity. No data can leave, even if the model server or application has a misconfiguration.
S3 VPC gateway endpoint for model weights. Attach a VPC gateway endpoint for S3 so instances can pull model weights from a private S3 bucket without internet access. Upload the model weights to S3 once from a connected machine, then all VPC instances can access them internally.
SSM endpoint for management. Use AWS Systems Manager (SSM) VPC endpoints for instance management instead of SSH over the internet. SSM sessions are logged and auditable, providing compliance-friendly access control.

Pre-loaded EBS snapshot workflow. For maximum deployment speed, pre-bake model weights into an EBS snapshot. Create an EBS volume, download the model weights to it, take a snapshot, and attach cloned volumes to new instances at boot. This eliminates the need to download model weights at all – the instance boots with the model already on disk, ready to load into GPU memory in seconds.

Cost estimates for code review workloads (approximate):

Small codebase – minutes to tens of minutes depending on model size and batching.
Large codebase – tens of minutes to multiple hours depending on model size, context window, and how you chunk work.

The on-demand model works well for periodic reviews – spin up a GPU instance, run the review, download the results, and terminate the instance. You pay only for the compute time used. For teams that run reviews weekly or more frequently, reserved instances or savings plans can reduce costs by 30-60%.

What you trade off

Local inference is a real trade-off, not a free upgrade. Understanding what you give up is essential for making an informed decision.

Speed. Expect local reviews to take materially longer than cloud API reviews, especially for large codebases where input (prefill) dominates. For VibeRails' overnight workflow, this is often an acceptable trade-off.
Vendor-specific features. Some cloud-only capabilities (like certain reasoning modes, response caching, or tool-call tuning) may not exist in your local runtime.
Operational work. Running local models means you own updates, model selection, and debugging. The simplest setup is still a server process on your network with GPU drivers, model weights, and monitoring.
Pure cost often favors the cloud. If data sovereignty is not a constraint, cloud APIs tend to be cheaper and faster on a per-review basis. Self-hosting is usually justified by compliance, data sovereignty, or security policy constraints rather than pure economics.

When self-hosting is justified. Local AI code review makes sense when you have an explicit requirement that source code cannot be processed by an external AI provider (air-gapped environments, export-controlled programs, or internal DLP policies), or when you already operate GPU infrastructure and want to keep inference inside your boundary.

The critical reframe: for overnight batch review of legacy codebases, speed is not the limiting factor. What matters is that processing stays within the boundary you define.

Compliance frameworks

Local AI code review is the simplest path to compliance for several frameworks that restrict where sensitive data can be processed:

ITAR (International Traffic in Arms Regulations) – defence exports regulations that prohibit sharing controlled technical data with foreign persons or transmitting it to servers outside the United States. Local inference on US-based hardware keeps all technical data within the required boundaries. See ITAR-Compliant AI Code Review for a detailed walkthrough.
CMMC 2.0 (Cybersecurity Maturity Model Certification) – mandatory for many defence contractors handling CUI, depending on contract clauses and assessment requirements. CMMC Level 2 requires that Controlled Unclassified Information (CUI) is processed only within the authorization boundary. Local model inference can simplify boundary definition by keeping code processing inside your infrastructure. See CMMC and AI Code Review for Defence Contractors for implementation guidance.
SOC 2 – requires documented controls over data processing and storage. When the AI model runs on infrastructure you control, the entire data flow is easier to scope and document. You may still have third-party dependencies elsewhere in your stack; the claim here is specifically about keeping inference inside your boundary.
GDPR (data sovereignty) – for EU-based organisations, local processing ensures that source code and analysis results remain within the required jurisdiction if your infrastructure is deployed accordingly. This reduces (but does not automatically eliminate) cross-border data transfer concerns; validate requirements with your legal and security teams.

In each case, the compliance argument is the same: by running inference locally, you eliminate the data transfer that triggers regulatory scrutiny. When your environment is configured with the right egress controls, the model runs on hardware you control and the results stay inside the boundary you define.

Get started

Download VibeRails, install Ollama, pull a model, and run your first local AI code review. The entire setup takes under ten minutes. The free tier includes 5 issues per review – enough to validate the workflow with your local model before committing. Pro plans start at $19/month, or $299 for a lifetime licence per developer.

Download Free See Pricing