Guide

How to deploy open-source LLMs cheaply in 2026

API, dedicated endpoint, GPU rental, or self-host — the cheapest shape depends on volume, model size, utilization, and how much operational complexity you can carry. Decision tree, break-even math, and three worked examples with real 2026 prices from the nfer pricing index.

Last updated · 2026-05-12

Why this guide exists#

Inference is the line item that keeps growing. Frontier models keep getting more capable and more expensive per token, workloads pointed at them keep growing in volume, and a $0.20/1M output-token difference compounds into thousands of dollars a month above 100M tokens/day. The teams that win on unit economics in 2026 don't point a frontier model at every workload. They pick the smallest model that's good enough for each job, route only the hard cases to expensive frontier APIs, and run the rest on whatever hosting shape is cheapest for that specific model. Using GPT-4-class or Claude Opus-class inference for a classification pipeline is a helicopter trip to the supermarket — fast, wrong tool.

Once you've chosen which model fits this workload, the next lever is where to run it. This guide is about that second decision, for open-source models. We walk through the four hosting shapes — API, dedicated endpoint, on-demand GPU rent, self-host — show how the cheapest one shifts as volume, model size, utilization, and operational tolerance move, and work three scenarios end-to-end: Llama-3.3-70B-Instruct and Mistral-7B-Instruct-v0.3 on rented GPUs, plus a bought-hardware self-host. The two model sizes are intentional — a 70B and a 7B behave very differently in the rent-vs-buy decision.

TL;DR

The cheapest open-source LLM deployment is a function of four variables: tokens per day, model size and quantization, sustained utilization, and the operational complexity you can carry. Below ~1.5M Llama-3.3-70B tokens per day, API wins. Between 1.5M and 3M per day, a managed dedicated endpoint usually wins. Above 3M per day at >50% utilization, renting a GPU by the hour beats API by 30-60%. Buying hardware pays back in 14-20 months for large models against rental and in 6-12 months for small models against API — if the workload is mature, stable, and predictable. The cheapest provider within each shape is a second decision worth another 2-3×; use nfer's host pages for the live ranking. Break-even curves shift materially with the model — what's true for a 70B isn't true for a 7B.

The four deployment shapes#

No one-shape-fits-all answer exists for open-source inference. Broadly, we consider four shapes; each has a different cost structure and operational footprint, and the right one depends on the workload.

ShapeCost structureOps costCheapest when
API
Nebius, Together AI, Fireworks, Groq, DeepInfra…
$ per 1M tokens (input and output priced separately)Near-zeroBursty workloads; volumes below the GPU-rent crossover
Dedicated endpoint
Together AI Dedicated, Fireworks Dedicated, Replicate Cog…
$ per hour reserved, billed continuouslyLow — provider handles autoscalingSteady mid-volume; predictable load with bursts
On-demand GPU rent
Lambda, RunPod, Verda, CoreWeave, OVHcloud…
$ per hour on-demand; bring your own image (vLLM, TGI, sglang, llama.cpp)Medium — you own the inference server, monitoring, retriesSustained high volume at >50% utilization
Self-host
Your hardware in your DC or a colo rack
Capex + power + facility + ops, amortized over the asset's lifeHigh — full stack, hardware refresh includedMature, stable workload with predictable 24/7 throughput; multi-year horizon; regulated data

Two clarifications because they're commonly miscounted:

API ops cost isn't zero, it's near-zero. You still write retry logic, handle per-provider rate limits, monitor latency, and pick between providers when one degrades — but you don't run servers. Treat it as <1 engineering hour per week.

Dedicated endpoints sit between API and GPU rent, not above them. Reserved-hour pricing beats per-token once you exceed ~12-16 hours per day of equivalent API tokens, but you eat the cost on slow days because billing is continuous.

Decision tree#

Use these rules to pick the shape before comparing providers. Thresholds anchor to Llama-3.3-70B-Instruct at Q4 (~40GB VRAM) because it's the most-asked open 70B in the nfer index as of 2026-05-12. The same structure applies to other models with shifted thresholds — see the worked examples.

  1. <1.5M tokens/day?API. At Nebius's Llama-3.3-70B-Instruct rate of $0.40/1M output (cheapest in the nfer index as of 2026-05-12), 1.5M tokens/day is ~$18/mo. No GPU-rent option beats that once you factor in idle hours. Recommended starting point for >90% of new workloads. The cheapest provider within API can differ 2-3× for the same model — answered live on nfer's Llama-3.3-70B-Instruct host page.
  2. 1.5M-3M tokens/day, predictable load? Dedicated endpoint. A single-GPU endpoint at ~$2.50-3.50/hr reserved runs ~$1,800-2,500/mo, beating API around 2.5M+ tokens/day. Pick this when you'd rather not run a GPU yourself.
  3. 3M+ tokens/day at >50% utilization? On-demand GPU rent. A single H100 80GB SXM on Verda at $2.29/hr is ~$1,650/mo at 100% uptime; Nebius H100 SXM at $2.95/hr is ~$2,125/mo. At Q4 you serve 70B-class workloads; at fp8 you can host two Mistral-7B copies. Below 50% utilization, API still wins.
  4. Mature, stable 24/7 workload for 18+ months, OR compliance / data-residency forces it? Self-host. For 70B-class, a new H100 80GB PCIe is ~$25-35k plus ~$300-800/mo for power + colo; crossover vs renting the same H100 sits at 14-20 months. For 7B-class workloads the math is much faster — a used consumer GPU at $700-1,800 can pay back against API in 6-12 months. See worked example 3.

The most common mistake is jumping to GPU rent on volumes that don't justify it. Below 1.5M tokens/day on a 70B-class model, a rented H100 sits idle ~80% of the time. API wins by 2-4× in that regime.

Break-even math#

For any open-source model the API ↔ GPU-rent crossover is:

break-even tokens/day
  = (GPU $/hr × 24)
  ÷ (API $ per 1M tokens × blended I/O ratio)
  × 1,000,000

Worked example — Llama-3.3-70B-Instruct, Q4 on a single rented H100 SXM at Verda's $2.29/hr (cheapest H100 80GB in the nfer index as of 2026-05-12):

  • GPU rent: $2.29/hr × 24 = $54.96/day
  • API rate: Nebius Llama-3.3-70B-Instruct at $0.13/1M input and $0.40/1M output (cheapest provider for this model)
  • Blended I/O ratio 3:1 (output:input); blended rate ≈ $0.33/1M
  • Break-even: 54.96 ÷ 0.33 × 1,000,000 ≈ ~166M blended tokens/day at 100% sustained throughput

A real H100 serving 70B-Q4 with vLLM doesn't sustain 100%. Plan for 30-50% sustained throughput after batching limits, prompt-cache warmup, and traffic shaping. Adjusted at 40%: ~65M blended tokens/day before GPU rent beats API at Nebius rates. Against Together AI ($0.88 blended), the same H100 crosses around 25M tokens/day at 40% — which is why provider choice matters as much as shape choice.

The crossover shifts with model size. A 7B on a $1.17/hr L4 crosses around 1-2B tokens/month — the GPU is cheaper, but so is the API (~$0.11/1M at OVHcloud for Mistral-7B). Always recompute against the actual model.

For a full calculator with your assumptions, use nfer's host-model page — same math, current prices.

Worked example 1: Llama-3-70B on a rented A100#

Cost case for a 70B-class open model on a rented A100. The deployment recipe (vLLM image, container flags, gateway, monitoring) is its own discipline and isn't covered here.

What you're buying:

  • 1× A100 80GB rented hourly — fits Llama-3.3-70B at Q4 (~40GB) with headroom for KV-cache and concurrent batches.
  • Representative rates (2026-05-12): Verda A100 SXM at $1.29/hr (cheapest A100 80GB in the nfer index); Lambda Labs A100 at $1.99/hr; CoreWeave A100 80GB PCIe at $2.21/hr. Spread across the nfer index for on-demand A100 80GB is $1.29-$3.59/hr depending on tier.

Realistic monthly cost:

  • Verda A100 SXM at $1.29/hr × 24 × 30 = $929/mo at 100% uptime.
  • At 50% useful utilization, the effective rate is ~$1.86 per useful-GPU-hour — still cheaper than Together AI Dedicated for 70B-Q4 in the same index.

When this beats API:

Compared to Together AI's Llama-3.3-70B API at $0.88/1M output, $929/mo of GPU buys ~1.05B output tokens of equivalent API spend. The A100 sustains roughly 1-2B output tokens/month at vLLM's typical 70B-Q4 throughput. At ≥1B output tokens/month through this box, you beat Together AI's API by ~40-60% after subtracting ~$200/mo of monitoring + on-call attention. Compared to Nebius's Llama-3.3-70B at $0.40/1M output, the crossover roughly doubles — the cheaper API pushes rent-beats-API to ~2B output tokens/month.

When it doesn't:

  • Below ~700M output tokens/month, the A100 idles enough that API is cheaper.
  • Bursty workloads lose money to idle on a rented GPU.
  • No on-call rotation → operational risk eats the savings.

Worked example 2: Mistral-7B on a rented L4#

Single-GPU deployment of a sub-10B open model. Compared to example 1, the rent-beats-API threshold is higher because both GPU and API are cheaper — the central insight: small-model deployments are harder to justify on rented hardware unless volume is large.

Mistral-7B-Instruct-v0.3 is genuinely open: Apache 2.0, weights on HuggingFace at mistralai/Mistral-7B-Instruct-v0.3. Mistral's commercial models — Large, Medium, Small 3 — are API-only and proprietary; don't confuse them. Drop-in alternatives on the same footprint: Qwen2.5-7B-Instruct (Apache 2.0) and Llama-3.1-8B-Instruct (Llama Community License).

What you're buying:

  • 1× NVIDIA L4 24GB rented hourly — fits Mistral-7B-Instruct at fp16 (~14GB).
  • Rates (2026-05-12): AWS L4 at $0.98/hr (cheapest L4 in the nfer index); OVHcloud L4 at $1.17/hr; Scaleway L4 also in the index.

Realistic monthly cost:

  • AWS L4 at $0.98/hr × 24 × 30 = $705/mo at 100% uptime; OVHcloud L4 ~$842/mo.
  • At 50% utilization, throughput sits around 250M-500M tokens/month.

When this beats API:

OVHcloud's Mistral-7B-Instruct API at $0.11/1M is the cheapest in the nfer index (2026-05-12). At that rate, $705/mo of API spend buys ~6.4B tokens. The L4 only beats API for Mistral-7B above ~3-6B tokens/month sustained — substantially more volume than the 70B equivalent. Below that, API wins, and that covers nearly every production workload for a 7B-class model.

Side by side
  • For a 70B model, rent-on-A100 beats Together AI's API above ~1B output tokens/month.
  • For a 7B model, rent-on-L4 needs ~3-6B tokens/month — more tokens, not fewer, despite the cheaper GPU.
  • That counter-intuitive result is why model size belongs in the break-even equation alongside volume.

Worked example 3: Self-host on bought hardware#

Buy the GPU, rack it, pay power + facility, amortize capex. Self-host is rarely cheapest at low volume — it's cheapest at predictable high utilization over multi-year horizons. The payback math differs materially between large-model and small-model workloads.

Scenario 3a — Small model, consumer hardware

Mistral-7B-Instruct on a used RTX 3090. Cheapest defensible self-host for a 7B-class workload running 24/7.

  • GPU: Used RTX 3090 24GB on the secondary market at $600-800 (2026-05-12 eBay listings). Fits Mistral-7B-Instruct at fp16 with headroom; at Q4 you can run two copies side-by-side.
  • Host chassis: $1,200-1,800 new, or use an existing dev box.
  • Power: ~350W × 24 × 30 × $0.15/kWh ≈ $38/mo.
  • Facility: $0/mo at home/office; $100-150/mo for a single-U colo with ~500W power budget.

Amortized monthly:

  • Capex: $1,800-2,600 / 36 months = $50-72/mo.
  • Opex: $38-188/mo for power + (optional) colo.
  • Total: ~$90-260/mo for sustained 24/7 Mistral-7B serving.

Payback:

  • vs. AWS L4 rental at $705/mo: self-host saves $445-615/mo. Capex pays back in 3-6 months at the high-savings end.
  • vs. OVHcloud Mistral-7B API at $0.11/1M: at ~1.3B tokens/month sustained, API runs ~$143/mo. Below that, API is still cheaper.

This is where the "owning hardware only pays back over multi-year horizons" claim breaks. For a small model running 24/7 against rental, breakeven is under six months — not multiple years.

Scenario 3b — Large model, datacenter hardware

Llama-3.3-70B-Instruct on a new H100 or used A100. Upfront cost is large enough that payback math is genuinely a 1-2 year commitment.

  • GPU: New H100 80GB PCIe ~$25-35k (distributor quote 2026-05-12); used A100 80GB SXM ~$15-20k on the secondary market.
  • Host server: 1U/2U with PCIe + cooling, $5-10k.
  • Power: H100 ~350-700W under load; ~$60-130/mo at $0.12/kWh.
  • Colo: 1U with 500-800W power budget: $300-800/mo.

Amortized monthly:

  • Capex (new H100): $30-45k / 36 mo = $830-1,250/mo.
  • Capex (used A100): $20-30k / 36 mo = $560-830/mo.
  • Opex: $360-930/mo.
  • Total (new H100): $1,190-2,180/mo. Total (used A100): $920-1,760/mo.

Payback:

  • vs. renting the same H100 24/7 at Verda's $2.29/hr ≈ $1,650/mo: a new H100 self-host reaches parity around month 18-24. A used A100 self-host beats H100 rental from month one, with payback in 12-18 months.
  • vs. Together AI Llama-3.3-70B API at $0.88/1M output: needs ≥2-3B output tokens/month for self-host to stack up. vs. Nebius at $0.40/1M output, you'd need ≥4-6B tokens/month — cheaper API pushes self-host's payback further out.

The honest summary on 70B self-host: it's a control decision more than a cost decision. Cost savings vs renting the same GPU are real but modest (10-30%) and need 18+ months of stable utilization to materialize. You're buying data residency, deep customization, predictable latency, and no provider concentration risk.

When self-host is genuinely wrong

Even with favourable small-model math, skip self-host when:

  • Workload size is still changing weekly — you'll buy the wrong GPU.
  • Team <3 people — no on-call rotation means downtime at the worst time.
  • Multi-region failover needed from day one.
  • Model ships major refreshes every 4-6 weeks and you can't tolerate re-tuning.

Hidden costs nobody quotes#

Costs that don't appear on the pricing page but break business cases:

  • Egress. $0.05-0.12/GB on most clouds. A chatbot doing 1B tokens/month outputs ~4-6 TB → $200-700/mo. API providers absorb this; GPU rentals usually don't.
  • Persistent storage for weights. 70B-Q4 is ~35GB; 70B-fp16 is ~140GB. HuggingFace cache across reboots needs persistent storage at $0.08-0.10/GB-month.
  • Monitoring + on-call. Prometheus, Grafana, alerting, rotation. Budget 3-5 engineering hours per week steady-state.
  • Cold-start retries. Spot evictions, host failures. Plan for ~99.5% provider uptime, not 100%. The retry layer is part of the cost.
  • First deployment engineering. First vLLM + 70B + Q4 setup that serves traffic: typically 1-2 engineer-weeks.
  • Version drift. New OSS model weights ship every 4-12 weeks. Re-evaluation + re-quantization + re-deploy: 1-3 engineer-days per cycle.

Honest summary: a self-managed deployment with paying users costs ~$1,500-3,000/mo of engineering attention on top of the GPU bill. Below ~3-5M tokens/day on a 70B-class model, API is cheaper even before subtracting that overhead.

Quantization as a cost lever#

The single most powerful knob for cutting deployment cost is quantization — reducing bit-width to shrink VRAM and increase throughput. For well-tuned 4-bit on modern 70B models the quality cost disappears on most real workloads.

FormatBits/weightVRAM 70BVRAM 13BVRAM 7BQuality vs fp16
fp16 / bf1616~140 GB (2× H100/A100)~26 GB~14 GBBaseline
fp88~70 GB (1× H100)~13 GB~7 GB-0.5% to -1.5%
INT8 / W8A168 (weight only)~70 GB~13 GB~7 GB-0.3% to -1%
AWQ Q4 / GPTQ Q44~35-40 GB (1× A100 80GB)~7 GB~4 GB-1% to -3%; usually imperceptible
Q3 / Q2 (llama.cpp)2-3~22-30 GB~5 GB~3 GBVisible drop; offline only

Cost implications:

  • Q4 turns a 70B from a 2× H100 problem into a 1× A100 problem — 50-65% hardware-cost cut for -1-3% quality.
  • fp8 keeps near-fp16 quality and unlocks 1× H100 for 70B when quality must be preserved.
  • 7B-class models fit on a single L4 at fp16, so quantization is a throughput-and-latency lever, not a feasibility one.
  • Quantization isn't free in latency. AWQ/GPTQ kernels add 5-15% throughput penalty vs fp16 in real workloads; vLLM 0.6+ has closed most of the gap.

Lock the quantization format first; every downstream cost number flows from it.

When API still wins#

API is the right answer when:

  • You're below 1.5M Llama-3.3-70B-class tokens/day (or equivalent for your model). Inference-layer overhead eats the apparent savings.
  • No on-call rotation, or team <3 people. Inference fails at 3am; APIs absorb that, you can't.
  • <500ms p99 SLO from a global audience. APIs have edge presence in 5+ regions; rolling your own multi-region GPU fleet is a separate, larger project.
  • Data isn't regulated and scale isn't multi-year. The compliance + capex-amortization arguments only kick in when one of those is true.
  • You're prototyping. Wait for production-volume signal before re-evaluating.

If two or more apply, the answer is API today — revisit in 6-12 months. Within API, the cheapest provider varies materially by model; optimize that on nfer's host pages.

Conclusion — what to take away#

  1. There is no single cheapest deployment shape. The cheapest answer is a joint function of volume, model size, sustained utilization, and operational tolerance — and at 2026 prices it moves through API → dedicated → GPU rent → self-host as those variables grow. Re-evaluate when any of them changes.
  2. Provider choice within a shape matters as much as shape choice. Two API providers for the same Llama-3.3-70B can differ 2-3× on output-token price (Nebius $0.40 vs Together AI $0.88 vs OVHcloud $0.74 as of 2026-05-12). Pick shape first using the decision tree, then provider on nfer's host pages.
  3. Model size changes rent-vs-buy, not just price. 70B-class workloads pay back self-hosted hardware over 14-20 months against rental. 7B-class workloads can pay back used consumer hardware against rental in 3-6 months. Don't apply 70B intuitions to 7B problems.
  4. The line item that breaks most self-managed cases is engineering attention, not compute. Budget 3-5 hours per week of on-call + ops on top of the GPU bill, or the case collapses against API. For small teams, API is almost always the right answer until volume forces the move.
  5. Quantization is the lever that changes every other number. Q4 turns a 2× H100 problem into a 1× A100 problem at -1-3% quality; fp8 keeps near-fp16 quality on 70B with a single H100. Lock the format first.

If you remember nothing else: decide model first, then shape, then provider — in that order. Each decision constrains the next; getting them in the wrong order is how teams overpay.

FAQ#

What's the cheapest provider for Llama-3.3-70B-Instruct API today?

As of 2026-05-12, the cheapest provider in the nfer index for Llama-3.3-70B-Instruct is Nebius at $0.13 per 1M input tokens and $0.40 per 1M output tokens. Together AI sits at $0.88 in/out, and Groq at $0.59/$0.79. The full live ranking is on nfer's host page for Llama-3.3-70B-Instruct, updated whenever the pricing pipeline refreshes.

How many tokens per day do I need before renting a GPU is cheaper than API?

For Llama-3.3-70B-Instruct at Q4 on a rented H100 at realistic 40% sustained utilization the crossover is around 25-30M blended tokens per day. For a 7B-class model on a rented L4 it is closer to 1.5B tokens per month sustained — cheaper GPU, but the API is cheaper too. The nfer comparator runs this math against your actual model and assumptions.

When does buying the GPU actually pay back?

For a 7B-class workload at sustained 24/7 throughput, a used consumer GPU at $600-1,800 pays back against API in 6-12 months and against rental in 8-15 months. For a 70B-class workload, a new H100 at $25-35k pays back against rental in 18-24 months; payback against API requires roughly 2-3B output tokens per month sustained.

Can I deploy Mistral-7B-Instruct legally and free of license fees?

Yes. Mistral-7B-Instruct v0.1, v0.2, and v0.3 are released under Apache 2.0 by Mistral AI; the weights are on HuggingFace and you can run them in production with no per-token fees. Do not confuse these with Mistral's commercial models (Large, Medium, Small 3) which are API-only and proprietary.

Does quantization hurt quality enough to matter?

For Q4 (AWQ or GPTQ) on a well-instruction-tuned 70B model the standard-benchmark quality cost is typically 1-3% and is rarely user-perceptible. Q2-Q3 has a visible drop and is only worth it for offline workloads. fp8 is near-lossless and is the right choice when you need a 70B on a single H100 with no quality compromise.

Related reading on nfer: Llama-3.3-70B-Instruct · Llama-3.1-70B-Instruct · Mistral-7B-Instruct-v0.3 · Qwen2.5-VL-72B-Instruct · Methodology.