Run full-codebase AI code review on hardware you control. VibeRails orchestrates models through the Claude Code CLI, so the workflow stays the same whether inference runs on your desktop GPU, a Mac Studio, or a network-isolated cloud VPC.
Local AI code review means running an AI model on hardware you control – your workstation, a server in your rack, or a GPU instance in your own cloud VPC – instead of sending code to a third-party API endpoint over the internet. The model processes your source code, produces findings, and returns results without any data leaving your network boundary.
Before diving into the setup, it helps to understand a few terms that come up frequently when working with local models:
The core value proposition is straightforward: VibeRails combined with a local model can let you run AI-assisted code review without sending code to a third-party AI API. In practice, this works by routing Claude Code's API traffic to an Anthropic-compatible endpoint you control (for example, an Ollama server on localhost, or a model server inside a private VPC). From VibeRails' perspective, the review workflow is the same; where the code goes is determined by where the CLI sends requests and by your network egress controls.
The data flow for local AI code review through VibeRails has five steps, all of which happen within your network:
claude CLI process with the review prompt and your project files, exactly as it would for a cloud API review.ANTHROPIC_BASE_URL environment variable – instead of sending requests to api.anthropic.com, the CLI sends them to your local model server (for example, http://localhost:11434 for Ollama).stream-json output. As long as you're using the same CLI, the output format VibeRails expects stays consistent even if the CLI is routing requests to a different endpoint.A typical local setup is a small set of environment variables (exact requirements can vary by CLI version and server, but this works with Ollama's Anthropic-compatible endpoint):
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
ANTHROPIC_BASE_URL tells the CLI where to send requests. ANTHROPIC_AUTH_TOKEN
is required by Claude Code even when the endpoint is local; Ollama accepts any value here. If you're
using a proxy or a hosted Anthropic-compatible endpoint, this token may be real authentication.
For a current, working reference configuration (including model names), see Ollama's Claude Code guide.
Model-name compatibility note. Claude Code passes a model identifier in each request.
Your local Anthropic-compatible server must accept that identifier (or map it to the local model you
want to run). In VibeRails, you can set a Custom Claude Model ID in
Settings to pass a local model name directly (for example qwen3-coder).
If your local server is strict about accepted model IDs, use a compatibility proxy (for example,
LiteLLM) to map incoming model IDs to your local model.
The local model landscape changes rapidly. Rather than hard-coding benchmark scores (which change with evaluation scaffolds and new releases), here is a practical way to pick a starting model for code review:
If you want to compare model capability, use a public, reproducible benchmark like SWE-bench Verified as one signal, and validate on your own repositories as the final arbiter.
Last updated: February 2026. Check model leaderboards and vendor documentation for the latest model names and packaging.
The hardware you need depends on the model you want to run and the context length you need. Here are three practical tiers, focused on what matters most for local inference: memory capacity.
Here is the critical insight for code review specifically: speed often doesn't matter. The primary use case for local AI code review is batch processing: start a review at 6pm, come back to a complete bug report at 9am. If you're not waiting on the result interactively, you can trade latency for data sovereignty without sacrificing the workflow.
This changes the hardware calculus. Instead of optimizing for tokens-per-second (which favors expensive multi-GPU setups), you optimize for memory capacity (which determines what models you can run at all).
This walkthrough uses Ollama, which provides an Anthropic-compatible endpoint that Claude Code can talk to. The process takes under ten minutes if you already have a supported GPU.
# 1. Install Ollama (see ollama.com for platform-specific options)
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull a coding model (example; model names change over time)
ollama pull qwen3-coder
# 3. Configure Claude Code to talk to Ollama's endpoint
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
# 4. Smoke test the connection
claude -p "Reply with only: OK" --model qwen3-coder
Step 1 installs Ollama, which manages model downloads and serves a local API endpoint. On
macOS, you can alternatively install via Homebrew (brew install ollama) or download
the desktop app from ollama.com.
Step 2 downloads the model weights. This is a one-time operation: weights are cached locally and reused across runs.
Step 3 configures Claude Code to route requests to Ollama instead of a cloud endpoint. You can add these
environment variables to your shell profile (~/.bashrc, ~/.zshrc) to make them
persistent, or set them only when you want to run local reviews.
Step 4 confirms your local endpoint is working before you run a long review.
Run the review. Open VibeRails, select your project, and start a review. The key requirement is that the model ID VibeRails requests (via Claude Code) must be accepted by your local endpoint (or mapped by a compatibility proxy).
Alternative model servers. vLLM and llama.cpp can be used for local inference as well. Depending on the server you choose, you may need a compatibility layer that implements the Anthropic Messages API and maps model identifiers.
For organisations that need air-gap guarantees but cannot justify capital expenditure on dedicated hardware, cloud GPU instances offer a pay-per-use alternative. The key is configuring the cloud environment so that no data leaves your VPC – the AI inference happens on a GPU instance inside your network boundary, not on a third-party API endpoint.
Recommended instance types (pricing varies by region and date):
Network-isolated VPC architecture. The goal is to create a cloud environment where GPU instances can run inference without any path to the public internet:
Pre-loaded EBS snapshot workflow. For maximum deployment speed, pre-bake model weights into an EBS snapshot. Create an EBS volume, download the model weights to it, take a snapshot, and attach cloned volumes to new instances at boot. This eliminates the need to download model weights at all – the instance boots with the model already on disk, ready to load into GPU memory in seconds.
Cost estimates for code review workloads (approximate):
The on-demand model works well for periodic reviews – spin up a GPU instance, run the review, download the results, and terminate the instance. You pay only for the compute time used. For teams that run reviews weekly or more frequently, reserved instances or savings plans can reduce costs by 30-60%.
Local inference is a real trade-off, not a free upgrade. Understanding what you give up is essential for making an informed decision.
When self-hosting is justified. Local AI code review makes sense when you have an explicit requirement that source code cannot be processed by an external AI provider (air-gapped environments, export-controlled programs, or internal DLP policies), or when you already operate GPU infrastructure and want to keep inference inside your boundary.
The critical reframe: for overnight batch review of legacy codebases, speed is not the limiting factor. What matters is that processing stays within the boundary you define.
Local AI code review is the simplest path to compliance for several frameworks that restrict where sensitive data can be processed:
In each case, the compliance argument is the same: by running inference locally, you eliminate the data transfer that triggers regulatory scrutiny. When your environment is configured with the right egress controls, the model runs on hardware you control and the results stay inside the boundary you define.
Download VibeRails, install Ollama, pull a model, and run your first local AI code review. The entire setup takes under ten minutes. The free tier includes 5 issues per review – enough to validate the workflow with your local model before committing. Pro plans start at $19/month, or $299 for a lifetime licence per developer.
Tell us about your team and rollout goals. We will reply with a concrete launch plan.