Organisations with strict data handling requirements – ITAR, CMMC, data sovereignty mandates, or internal DLP policies – cannot send source code to external AI APIs. The standard advice is to buy GPU hardware and run models locally. But capital expenditure on a GPU workstation that only runs AI models a few times a month is a hard sell to any finance department. The hardware sits idle between reviews. It depreciates. It needs maintenance. And by the time the purchase order clears, the model landscape has moved on.
There is a middle path: rent a cloud GPU for the duration of a code review, run the model inside a private VPC that has no connectivity to the public internet, extract the findings, and shut the instance down. This is not a physical air gap, but it can provide strong network isolation when configured correctly. You pay only for compute time used.
This guide covers the specific infrastructure setup, GPU options and pricing, the air-gapped VPC architecture, how to get model weights into the instance without internet access, realistic cost-per-review estimates, and an honest comparison with the Anthropic API for teams that are not constrained by compliance.
The problem: need local processing, can't afford dedicated hardware
The constraint is a two-sided squeeze. On one side, compliance requires that source code stays within your controlled infrastructure. On the other side, the budget for a dedicated GPU workstation – typically $5,000 to $15,000 for a machine capable of running 70B parameter models – cannot be justified for a tool that runs intermittently.
A team that reviews its codebase monthly might use the GPU for 4 to 8 hours total per month. The rest of the time, the machine sits under a desk drawing idle power. For organisations that review quarterly, the utilisation is even lower. Procurement teams look at the cost-per-hour of actual use and the number is painful.
Cloud GPU rental eliminates this problem entirely. You spin up an instance when you need it, run the review, and terminate it when you are done. The per-hour cost is higher than the amortised cost of owned hardware at high utilisation, but the total monthly spend is a fraction of the capital outlay. For teams that run reviews weekly or less frequently, on-demand cloud GPU is almost always the more economical choice.
The critical requirement is that the cloud environment must be configured so that source code cannot reach the public internet. This is not the default configuration for any cloud provider. It requires deliberate VPC architecture – and getting it right is the difference between a compliant review environment and a data exfiltration risk.
AWS EC2 GPU options with pricing
AWS offers the broadest selection of GPU instance types for on-demand AI inference. The right choice depends on the model size you intend to run, which in turn depends on the quality level you need.
| Instance | GPU | VRAM | On-Demand $/hr | Spot $/hr | Best For |
|---|---|---|---|---|---|
| g6.xlarge | L4 | 24 GB | $0.81 | ~$0.36 | Budget: 7–13B models |
| g5.xlarge | A10G | 24 GB | $1.01 | ~$0.42 | Budget: 7–13B models |
| P5 single-GPU | H100 | 80 GB | $3.90 | ~$1.50 | Quality: 70B models |
The g6.xlarge is the workhorse for budget-conscious reviews. The L4 GPU handles quantised 13B models (like Devstral Small 2 or Qwen3-Coder-Next at Q4 precision) comfortably within its 24 GB VRAM. The g5.xlarge with its A10G GPU offers similar capacity at a slightly higher price point but wider availability in some regions. For teams that want the highest-quality findings from a 70B+ parameter model, the P5 with an H100 provides 80 GB of VRAM – enough for full-precision 70B models or quantised models up to around 140B parameters.
Spot pricing is worth considering for non-urgent reviews. Spot instances are excess AWS capacity offered at a discount, typically 50–65% off on-demand pricing. The tradeoff is that AWS can reclaim spot instances with a 2-minute warning. For a 30-minute code review, the risk of interruption is low, and you can mitigate it entirely by using spot blocks (fixed-duration spot instances) where available. If the instance is reclaimed mid-review, you restart the review on a new instance – the codebase and model weights are on persistent EBS storage, so nothing is lost except the in-progress analysis time.
Alternatives beyond AWS. Several cloud GPU providers offer competitive pricing and simpler billing models.
RunPod, Lambda Labs, and Vast.ai are common options for on-demand GPUs outside of AWS. The exact $/hr and billing granularity change frequently (and vary by region), so treat any specific numbers you see online as estimates and validate current pricing before building internal cost models.
For export-controlled or government workloads, you may need a regulated cloud environment (for example AWS GovCloud or Azure Government) and a network-isolated architecture. Whether a given cloud environment satisfies your program requirements depends on your contracts, boundary definition, access controls, and assessor expectations.
Network-isolated VPC setup (no public internet egress)
The default AWS VPC can include an internet gateway that allows instances to communicate with the public internet. For a network-isolated code review environment, you remove all routes to the public internet while preserving the ability to manage the instance and access required AWS services through VPC endpoints.
Create a private VPC with no internet gateway and no NAT gateway. This is the foundational requirement. Without an internet gateway, instances in the VPC have no route to the public internet. Without a NAT gateway, instances cannot initiate outbound connections to internet hosts. This materially reduces exfiltration paths, but you should still treat security groups, IAM, and endpoint policies as part of the boundary.
S3 VPC Gateway Endpoint. You still need a way to get model weights and tooling into the instance. An S3 VPC Gateway Endpoint provides private connectivity from your VPC to S3 without traversing the internet. Traffic between your instance and S3 stays entirely within the AWS network. Gateway endpoints are free – there are no per-hour charges and no per-GB data processing fees. You add the gateway endpoint to your VPC route table and your instance can access S3 buckets in the same region as if they were on the local network.
SSM Interface Endpoints (PrivateLink). Without internet access, you cannot SSH into the instance through a public IP. AWS Systems Manager (SSM) Session Manager provides terminal access to instances without requiring inbound security group rules or a bastion host. But SSM itself needs connectivity to the SSM service endpoints. In a private VPC, this connectivity comes through Interface VPC Endpoints (powered by PrivateLink). You need three endpoints:
com.amazonaws.region.ssm – the core Systems Manager endpoint for API calls and configuration.
com.amazonaws.region.ssmmessages – handles the bidirectional communication channel for Session Manager sessions.
com.amazonaws.region.ec2messages – enables the SSM agent on the instance to communicate with the SSM service.
Interface endpoints do have an hourly cost (approximately $0.01 per AZ per hour, plus $0.01 per GB of data processed), but the total is negligible for a review session – typically under $0.10 for a multi-hour session.
Security group configuration. The security group attached to your GPU instance should have no inbound rules (or SSH from a bastion host if you prefer that access method over SSM). Outbound rules should allow HTTPS (port 443) only to the VPC endpoint prefix list – this permits communication with S3 and SSM through the VPC endpoints while blocking all other outbound traffic. The prefix list is a managed list of IP ranges for the VPC endpoints in your region. AWS maintains it automatically.
The result is an instance that can access S3 and be managed via SSM, but has no path to any internet-facing service. Source code uploaded to this instance cannot leave the VPC. Model weights downloaded from S3 travel only within the AWS network. The air gap is enforced at the network layer, not by policy or convention.
Pre-loaded EBS snapshot workflow
Getting model weights into an air-gapped instance is the most common point of friction. The recommended approach uses EBS snapshots to pre-stage model weights so they are available on disk immediately when the instance boots, with no internet access required.
Step 1: Download model weights on a temporary instance. Launch a small, inexpensive instance (a t3.medium is sufficient) in a standard VPC with internet access. Download the model weights to an attached EBS volume. For a 70B model at Q4 quantisation, the weights are approximately 35 GB. For a 13B model, approximately 7–8 GB.
# On the temporary instance with internet access
sudo mkfs.xfs /dev/xvdf
sudo mkdir /models
sudo mount /dev/xvdf /models
# Download model weights via Ollama
curl -fsSL https://ollama.com/install.sh | sh
OLLAMA_MODELS=/models/ollama ollama pull devstral-small-2
Step 2: Create an EBS snapshot. Once the model weights are on the EBS volume, unmount it and create a snapshot. This snapshot is stored in S3 (managed by AWS, not in your account's S3 buckets) and can be used to create new volumes in any AZ within the same region.
aws ec2 create-snapshot \
--volume-id vol-0abc123def456 \
--description "Devstral Small 2 model weights" \
--tag-specifications 'ResourceType=snapshot,Tags=[{Key=Purpose,Value=ai-code-review}]'
Step 3: Reference the snapshot in your GPU instance launch template. In your launch template or CloudFormation stack, specify a secondary EBS volume created from this snapshot. Mount it at /models in the instance's user data script.
Step 4: Enable Fast Snapshot Restore (FSR). By default, EBS volumes created from snapshots use lazy loading – blocks are fetched from S3 on first access, which causes significant latency when loading a 35 GB model. FSR pre-initialises the volume so all data is available at full EBS performance from the moment the volume is attached. FSR costs approximately $0.75 per hour per snapshot per AZ while enabled. You can enable it just before launching your GPU instance and disable it afterward to minimise cost.
The result: when your GPU instance boots in the private VPC, the model weights are already on disk at /models. No internet access is needed. No S3 download at boot time. The model server (Ollama, vLLM, or similar) starts and loads the weights from local disk. Total boot-to-ready time is typically under 3 minutes with FSR enabled.
Alternative: S3 VPC endpoint download at boot. If you prefer not to manage EBS snapshots, you can upload model weights to an S3 bucket and download them via the S3 VPC Gateway Endpoint when the instance boots. Same-region S3 to EC2 data transfer is free. The tradeoff is boot time – downloading 35 GB from S3 at typical S3 throughput takes 3–5 minutes, compared to near-instant availability with a pre-loaded EBS snapshot. For smaller models (7–13B at 4–8 GB), the S3 approach adds only about 30 seconds and avoids the FSR cost entirely.
The user data script for either approach configures the environment so that VibeRails (via Claude Code CLI) routes all API calls to the local model server:
#!/bin/bash
# Mount the pre-loaded model volume
mkdir -p /models
mount /dev/xvdf /models
# Start Ollama with models directory
export OLLAMA_MODELS=/models/ollama
systemctl start ollama
# Model is already available from EBS snapshot
# Ollama serves on localhost:11434
# Configure Claude Code CLI for local inference
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_API_KEY="ollama"
export ANTHROPIC_AUTH_TOKEN="ollama"
Cost estimate per review session (how to think about it)
The total cost of a cloud GPU review is mostly compute time. Storage and VPC endpoint costs exist, but for short-lived sessions they are typically secondary.
A practical way to estimate cost is:
- Pick an instance type that can load your target model (VRAM is usually the constraint).
- Run a small calibration review (a representative directory or file group) and measure runtime.
- Multiply runtime by the provider's current hourly rate for that instance type.
For periodic audits, the operational advantage is that you can spin up a GPU instance, run the review, extract the report, and terminate the instance. There is no idle infrastructure cost between runs.
Comparison: self-hosted cloud GPU vs cloud API
The decision is usually not about saving money. It is about whether your compliance posture permits sending source code to an external AI API.
- Cloud API: simplest operationally and often faster. Best when your security and compliance teams allow it.
- Cloud GPU (self-hosted): more operational work (model server, weights, monitoring), but you can keep inference inside a network-isolated environment you control.
A self-hosted 70B-class model will not always match frontier cloud models on nuanced architectural reasoning. But for legacy codebases that have never had a systematic audit, even smaller models can surface real issues: obvious security mistakes, error-handling gaps, dead code, inconsistent patterns, and concrete bugs.
When cloud GPU makes sense vs when API is better
The decision framework is straightforward. The determining factor is not cost or quality – it is whether your organisation's compliance posture permits sending source code to an external API.
Use the Anthropic API if: your organisation has no compliance restrictions on cloud AI processing, you want the highest quality findings, you want the lowest cost per review, and you want the fastest turnaround. For most commercial software teams, this is the right choice. VibeRails supports the Anthropic API natively through Claude Code CLI, and the BYOK model means you use your existing Claude Code subscription with no additional AI costs from VibeRails.
Use cloud GPU if: compliance requires air-gapped processing but you cannot justify capital expenditure on dedicated hardware. This includes teams handling ITAR-controlled technical data, organisations processing CUI under CMMC requirements, teams subject to data sovereignty laws that prohibit sending code to US-based cloud AI providers, and any organisation whose security policy explicitly prohibits external AI APIs for source code analysis. Cloud GPU gives you the air-gap guarantee with operational expenditure instead of capital expenditure.
Use desktop hardware if: you have high review volume (daily or more frequent reviews), you want zero recurring compute costs after the initial hardware purchase, you operate in a SCIF or other physically disconnected environment where cloud access is impossible, or you need the lowest possible latency for interactive review workflows. A dedicated workstation GPU can amortize to a low cost-per-review over many sessions, and the hardware is always available without boot time. See our hardware guide for a practical decision matrix.
A hybrid approach is common. Many organisations use the Anthropic API for non-sensitive codebases and self-hosted infrastructure for classified or restricted work. VibeRails supports both modes through the same interface – the only difference is which environment variables are set. A developer can switch from cloud API to local model by changing ANTHROPIC_BASE_URL and restarting the review. No reconfiguration of VibeRails itself is required.
Putting it together: your first cloud GPU review
The complete workflow from zero to a finished code review on a cloud GPU takes approximately 30–45 minutes of setup time the first time, and under 10 minutes for subsequent sessions once your launch template and EBS snapshot are in place.
First-time setup: create the private VPC with S3 and SSM endpoints, download model weights to an EBS volume on a temporary instance, snapshot the volume, and create a launch template that references the snapshot. This is one-time infrastructure work. Once the VPC and snapshot exist, they are reusable indefinitely.
Each review session: launch the GPU instance from the template, connect via SSM, upload your codebase (or pull it from a private Git repository accessible within your VPC), run the review through VibeRails and Claude Code CLI, extract the findings, and terminate the instance. The compute meter runs only while the instance is live.
For teams that want to automate this further, the entire workflow can be scripted with the AWS CLI or wrapped in a CloudFormation stack that creates and destroys the review environment on demand. Some teams integrate this into their CI/CD pipeline – a scheduled job launches the GPU instance, runs the review overnight, stores the findings in S3, and terminates the instance before the team arrives in the morning. The cost of an unattended overnight review on a spot H100 is typically under $2.00.
Whether you use the Anthropic API, a cloud GPU, or desktop hardware, VibeRails provides the same review experience and the same structured findings. The model backend is a configuration detail, not a product limitation. For a detailed walkthrough of the local model setup – including Ollama configuration, model selection, and performance expectations – see our complete guide to local AI code review. To get started now, download VibeRails and follow the setup for your preferred infrastructure.
Limits and tradeoffs
- It can miss context. Treat findings as prompts for investigation, not verdicts.
- False positives happen. Plan a quick triage pass before you schedule work.
- Privacy depends on your model setup. If you use a cloud model, relevant code is sent to that provider; local models can keep inference on your own hardware.