LLM provider comparison 2026

Why this guide exists#

There has never been more choice and less clarity about where to run an LLM. Twenty providers in nfer's index offer at least one of the three deployment shapes that make up the open-source LLM market today: per-token API, reserved dedicated endpoints, and on-demand GPU rental. The same model — Llama-3.3-70B-Instruct, say — is offered by eight API providers in the index, with output-token prices ranging from $0.40 per 1M (Nebius) to $1.20 per 1M (SambaNova), a 3× spread on identical capability. The same model is also rentable on nine different GPU configurations spanning $1.13 to $13.96 per hour, before you even consider the reserved-capacity discounts that some providers expose.

The reason the spread is so wide isn't quality — it's operating model. Some providers (Groq, Cerebras, SambaNova) optimize for token throughput on custom silicon and charge a premium for latency. Others (Nebius, DeepInfra) optimize for cost-per-token on commodity hardware and route aggressively. Hyperscalers (AWS, Azure, Google Cloud) charge a sovereignty + reliability premium. Five providers in the index are EU-headquartered and offer EU data residency. A few are vertically integrated — they own both the API and the GPU it runs on — which gives them pricing latitude others don't have.

This guide is built for the decision you're actually making: given a model, find the cheapest provider — across all three deployment shapes — with the sovereignty, region, and reliability constraints that apply to your workload. The buyer's matrix in section two is the anchor; per-provider profiles follow; the decision tree at the end maps "what to optimise for" to a short list of providers worth quoting.

TL;DR

Twenty providers in the index, three deployment shapes, one decision: cheapest provider for your model at your volume with your sovereignty constraints. Today (2026-05-12) the cheapest API for Llama-3.3-70B-Instruct is Nebius ($0.13 / $0.40 per 1M); the cheapest H100 SXM on-demand is Verda at $2.29/hr; the cheapest A100 80GB on-demand is also Verda at $1.29/hr. Within EU-owned providers the cheapest API for Llama-3.3-70B is also Nebius (Netherlands) and within French-owned providers it's OVHcloud at $0.74 in/out. Numbers shift with every pipeline refresh — the live ranking is on each model's host page on nfer.

The provider × pricing-mode matrix#

The single most useful artefact for choosing a provider is the matrix below. Each row is one provider in the nfer index; each column is one deployment shape; a tick means that provider offers at least one model in that mode. The value of this view is not a single price — it's seeing at a glance which providers compete on what.

Six providers in the index compete across both API and GPU rental for the same models (Together AI, Nebius, OVHcloud, Scaleway, Google Cloud, AWS). Those six are the most interesting from a buyer's perspective — they let you switch shape without switching vendor, which is rare and operationally valuable when a workload's volume pattern is still evolving.

Provider	API (per-token)	GPU rental (on-demand)	Dedicated / reserved	HQ	EU-owned
AWS (Bedrock + EC2 GPU)	✓	✓	Reserved instances	United States	—
Azure (AI Foundry + ND GPU)	via Foundry	✓	Reserved capacity	United States	—
Baseten	✓	—	Dedicated endpoints	United States	—
Cerebras	✓	—	—	United States	—
CoreWeave	—	✓	Reserved by the hour/month	United States	—
DeepInfra	✓	—	Dedicated endpoints	United States	—
Fireworks AI	✓	—	Dedicated endpoints	United States	—
Google Cloud (Vertex + A3 GPU)	✓	✓	Reserved capacity	United States	—
Groq	✓	—	—	United States	—
Lambda Labs	—	✓	1-yr / 3-yr commitments	United States	—
MiniMax	✓	—	—	China	—
Mistral AI	✓	—	—	France	✓
Nebius	✓	✓	Reserved capacity	Netherlands	✓
OVHcloud	✓	✓	Reserved 1mo / 6mo / 12mo	France	✓
Replicate	✓	—	Cog deployments	United States	—
SambaNova	✓	—	—	United States	—
Scaleway	✓	✓	—	France	✓
Together AI	✓	✓	Dedicated endpoints	United States	—
Verda	—	✓	Reserved capacity	Finland	✓
Zhipu AI (Z.AI)	✓	—	—	China	—

Two patterns worth noticing. EU-owned cluster (Mistral, OVHcloud, Scaleway, Nebius, Verda) is unusually strong for an open-source workload — five providers covering both API and GPU rental, with three of them (OVHcloud, Scaleway, Nebius) competing in both shapes. Custom-silicon API specialists (Groq, Cerebras, SambaNova) sit at the latency-optimised end — they're 5-10× faster on tokens-per-second for the same model but charge a premium of 30-100% per token in return.

Cheapest provider per popular model#

Live snapshot from the nfer index as of 2026-05-12. Click through to any model's host page for the full provider-by-provider ranking, including reserved-capacity tiers and current GPU-rental rates.

Model	Cheapest API ($ in / out per 1M)	# API providers	Cheapest GPU rental ($/hr)
Llama-3.3-70B-Instruct	Nebius · $0.13 / $0.40	8	Azure RTX PRO 6000 · $1.13
Llama-3.1-70B-Instruct	DeepInfra · $0.40 / $0.40	2	Azure RTX PRO 6000 · $1.13
Mistral-7B-Instruct-v0.3	OVHcloud · $0.11 / $0.11	2	Verda V100 · $0.28
Mixtral-8×7B-Instruct-v0.1	DeepInfra · $0.54 / $0.54	3	Verda V100 · $1.10
Qwen3-Coder-30B-A3B-Instruct	OVHcloud · $0.07 / $0.26	3	Azure RTX PRO 6000 · $0.55
Qwen2.5-VL-72B-Instruct	Nebius · $0.25 / $0.75	2	Azure RTX PRO 6000 · $1.13
Gemma-3-27B-IT	DeepInfra · $0.08 / $0.16	4	Azure RTX PRO 6000 · $0.55
DeepSeek-V3.2	DeepInfra · $0.26 / $0.38	5	Nebius H200 SXM · $3.50
DeepSeek-R1	Google Cloud Vertex · $1.35 / $5.40	3	Nebius H200 SXM · $3.50

Three things stand out. Cheapest providers concentrate at Nebius, DeepInfra, OVHcloud — together they hold the cheapest-API title on seven of the nine models above. Azure RTX PRO 6000 at $1.13/hr is the cheapest GPU option across most 70B-class models, largely because the RTX PRO 6000 is a much cheaper card to rent than an H100 SXM and fits a 70B at Q4 quantization. And for any given model, the API ↔ GPU-rent comparison hinges on volume — see the deployment guide for the break-even math.

API-first providers#

Token-priced, no infrastructure to manage, model-portable across providers that offer the same open-source model. Ranked by appearance frequency in the cheapest-API column across the index.

Nebius

Netherlands-headquartered hyperscaler challenger with full Llama-3 family coverage and unusually aggressive token pricing — currently cheapest in the nfer index for Llama-3.3-70B-Instruct ($0.13 / $0.40 per 1M) and Qwen2.5-VL-72B-Instruct ($0.25 / $0.75). Also offers H100 and H200 SXM on-demand GPU rentals at the cheaper end of the index. EU-owned, EU-data-residency capable. Best fit: cost-led API workloads where EU residency matters.

DeepInfra

US-headquartered API-only provider, broad open-source model catalogue, consistent cheap-by-default positioning. Cheapest API in the index today for Mixtral-8×7B ($0.54), Gemma-3-27B-IT ($0.08 / $0.16), DeepSeek-V3.2 ($0.26 / $0.38), and Llama-3.1-70B-Instruct ($0.40 / $0.40). Best fit: workloads where you want one provider for many open-source models without negotiating per-model.

OVHcloud

French hyperscaler with API + GPU rental in the same catalogue. Cheapest API for Mistral-7B-Instruct-v0.3 ($0.11 in/out) and Qwen3-Coder-30B-A3B ($0.07 / $0.26). Three reserved-tier GPU pricing levels (1-month / 6-month / 12-month commitments). EU-owned. Best fit: French / EU workloads needing data-residency guarantees and the option to switch between API and GPU rental without changing vendor.

Together AI

US-headquartered, both API and GPU rental in one catalogue. Wide open-source model coverage. Llama-3.3-70B-Instruct API at $0.88 in/out is not the cheapest in the index but is the most-cited baseline rate that other providers undercut. Dedicated endpoints are a clean transition path when your API spend crosses dedicated break-even.

Fireworks AI

US-headquartered, API-first with dedicated-endpoint upgrade path. Strong on serving infrastructure and speculative-decoding optimisations — competitive on throughput and tail-latency. Best fit: latency-sensitive production workloads on open-source models.

Groq

Custom-silicon API specialist (LPU). Wins on tokens per second, charges a premium per token vs commodity providers — Llama-3.3-70B at $0.59 input / $0.79 output is roughly 4-5× more expensive on output than Nebius for the same model. Best fit: latency-bound interactive workloads where token throughput is the constraint.

SambaNova

Custom-silicon API specialist (RDU). Like Groq, optimised for token throughput; charges a premium. Llama-3.3-70B at $0.60 input / $1.20 output — the most expensive of the eight API providers offering this model in the index. Best fit: enterprise workloads with hard latency SLOs and budget tolerance.

Cerebras

Custom-silicon API specialist (CS-3 wafer-scale). Smaller model catalogue than Groq or SambaNova but the fastest tokens/sec on the models it does host. Best fit: same as Groq.

Mistral AI

French model lab and API provider for its own hosted models (the Apache 2.0 7B family + the commercial Large / Medium / Small tier). Strong on EU sovereignty. The commercial Mistral models are API-only and not hostable outside Mistral's infrastructure — don't confuse them with the open Mistral-7B family which you can deploy anywhere.

Replicate, MiniMax, Zhipu AI, Baseten

Replicate (US) is a model-hosting platform — strong breadth, less competitive on per-token pricing for the most popular open-source models. MiniMax (China) and Zhipu AI / Z.AI (China) host their own model families plus selected open-source models; sovereignty considerations apply for non-China workloads. Baseten (US) is dedicated- endpoint-first with API on top.

GPU-rental specialists#

Bring your own image (vLLM, TGI, sglang, llama.cpp); pay by the hour. Best fit when sustained utilization is above the API ↔ GPU-rent break-even (see the deployment guide).

Verda

Finland-based GPU-rental specialist with the cheapest A100 SXM in the nfer index — $1.29/hr on-demand — and the cheapest H100 SXM at $2.29/hr. Three pricing tiers (on-demand, mid-commitment, long commitment). EU-owned, EU-data-residency capable. Best fit: cost-led 70B workloads in Europe.

Lambda Labs

US-based GPU rental, broad SKU coverage (H100, A100, A6000, L4, etc.) with on-demand and 1-/3-year reserved pricing. Reliable supply, mid-of-the-pack on price — A100 SXM at $2.79/hr, H100 SXM at $3.99/hr in the index. Best fit: production GPU workloads where supply and US data residency matter.

RunPod

US-based GPU rental with two tiers: Community Cloud (cheap, marketplace-style — variable host quality) and Secure Cloud (datacenter-grade, more expensive). Best fit: experimentation and bursty workloads where Community Cloud economics work, or production deployments on Secure Cloud when Lambda's catalogue doesn't match your SKU.

CoreWeave

US-based GPU specialist, enterprise positioning, deep hardware variety including A100 80GB PCIe at $2.21/hr and NVLINK variants. Reserved-by-the-hour or by-the-month options. Best fit: enterprise GPU workloads at scale with dedicated networking + storage needs.

Hyperscaler clouds#

AWS, Azure, and Google Cloud compete in this market through two surfaces: a managed-model API (Bedrock / AI Foundry / Vertex) and raw GPU instances (EC2 / NDv5 / A3). Both are in the nfer index where pricing is public.

AWS Bedrock hosts Llama-3.3-70B at $0.72 in/out — competitive but not market-leading; the differentiator is consolidated billing with the rest of your AWS footprint, IAM-integrated security, and cross-region availability. Google Cloud Vertex also hosts Llama-3.3-70B at $0.72 in/out, and is currently the cheapest API for DeepSeek-R1 in the index at $1.35 input / $5.40 output. Azure AI Foundry hosts Llama-3.3-70B via the Foundry API; Azure's GPU instances (ND H100 v5, NCv4 RTX PRO 6000) are competitive on the GPU side — the RTX PRO 6000 SKU at $1.13/hr is the cheapest 70B-Q4 footprint in the index right now.

For most cost-led open-source workloads, the hyperscalers are second-tier — Nebius, DeepInfra, and OVHcloud undercut them on API; Verda and Lambda undercut them on GPU rental. The hyperscalers win when your existing cloud footprint, contractual procurement, or compliance requirements outweigh the per-token premium.

EU-sovereign providers#

Five EU-owned providers in the nfer index, covering both API and GPU rental between them. Useful for workloads requiring EU data residency, GDPR processor agreements without a Standard Contractual Clauses transfer, or supply-chain diversification away from US-only providers.

Provider	Country	API	GPU rental	Notable strength
Mistral AI	France	✓	—	Own model family + Mistral-7B Apache 2.0
OVHcloud	France	✓	✓	API + GPU + reserved tiers in one vendor
Scaleway	France	✓	✓	API + GPU rental, predictable EU billing
Nebius	Netherlands	✓	✓	Cheapest Llama-3.3-70B API in the index
Verda	Finland	—	✓	Cheapest H100 SXM and A100 SXM in the index

The combination Nebius API + Verda GPU gives you a fully EU-sovereign stack across both deployment shapes — useful when you want to keep the same model portable between modes without leaving the EU. For workloads that need French-controlled processing specifically (some healthcare and public-sector requirements), OVHcloud and Scaleway are the two French-headquartered options offering both API and GPU.

Decision tree: which provider to pick#

Map your top constraint to a short list:

Optimise for cost, no other constraints? → Start with Nebius or DeepInfra on API; if volume justifies GPU rental, Verda (EU) or RunPod Community Cloud (US). Re-rank monthly — pricing shifts.
Optimise for latency / tokens-per-second? → Groq, Cerebras, or SambaNova. Expect a 30-100% per-token premium vs commodity providers in exchange for 3-10× faster generation. Watch for catalogue gaps — these providers host fewer models than the cost-led API providers.
Optimise for EU data residency? → OVHcloud (France, API + GPU), Scaleway (France, API + GPU), Nebius (NL, API + GPU), Mistral AI (France, API only), Verda (Finland, GPU only). Mix Nebius API + Verda GPU for a full-EU stack across both shapes.
Need consolidated billing with existing cloud contract? → AWS Bedrock, Google Cloud Vertex, or Azure AI Foundry. Expect a 20-50% per-token premium vs cheapest providers; you pay for procurement and IAM integration.
Need to switch between API and GPU rental as volume evolves? → Together AI, Nebius, OVHcloud, Scaleway, Google Cloud, or AWS are the six in the index offering both modes for overlapping model sets. Lets you switch shape without re-vendor-onboarding.
Prototyping, low volume? → Almost any API provider works. Start with Together AIor Fireworks AI for breadth, switch to Nebius / DeepInfra / OVHcloud when volume justifies the cost optimisation.

What's not in this guide#

Closed-frontier API providers — Anthropic (Claude), OpenAI (GPT), Google (Gemini direct), xAI (Grok), Z.AI (GLM) — aren't compared here because they're API-only by definition: you can't pick where the model runs, so the deployment-shape decision this guide is organised around doesn't apply. If you've already decided you need a frontier closed model, the question is whichfrontier wins for your task, which is a benchmark question, not a deployment-cost question. Artificial Analysis and LLM-Stats do that comparison well.

Aggregator routers — OpenRouter, Portkey, LiteLLM — sit on top of the providers in this guide, routing requests to whichever back-end provider is cheapest, fastest, or available. They're complementary, not competitive — nfer's index is the source-of-truth for what the underlying provider charges, the aggregator's value is in the routing layer.

Self-host scenarios — buying your own hardware, racking it in a colo, amortising over 36 months — are covered in the deployment guide with worked examples for both 7B and 70B workloads. The short version: self-host beats rental from ~12 months at stable 24/7 utilization on a 70B; from ~6 months on a 7B with consumer hardware.

FAQ#

Which provider is cheapest for Llama-3.3-70B-Instruct as of 2026-05-12?

Nebius at $0.13 per 1M input tokens and $0.40 per 1M output tokens, across the eight API providers in nfer's index that host Llama-3.3-70B-Instruct. Together AI is at $0.88 in and out; Groq at $0.59 in / $0.79 out. The live ranking is on nfer's Llama-3.3-70B-Instruct host page.

How does GPU rental compare to API pricing for the same model?

It depends on volume and model size. For Llama-3.3-70B-Instruct at Q4 on a rented H100 SXM (Verda at $2.29/hr is cheapest in the nfer index), at 40% sustained utilization the break-even is around 25-30M tokens per day against Together AI's API. Below that volume, API wins. For a 7B-class model on a rented L4, the crossover is much higher in tokens because the API is also much cheaper.

Which providers are EU-owned and EU-data-residency capable?

Five providers in nfer's index are EU-owned: Mistral AI and OVHcloud and Scaleway (France), Nebius (Netherlands), and Verda (Finland). OVHcloud, Scaleway, and Nebius offer both API and GPU-rental modes; Mistral is API-only for its hosted models; Verda is GPU-rental specialist.

Why does this guide focus on open-source LLM providers instead of frontier models?

Closed-frontier models (Anthropic Claude, OpenAI GPT, Google Gemini, xAI Grok) are API-only by definition — you can't deploy them on your own GPU. The deployment-shape question (API vs dedicated vs GPU rent vs self-host) only applies to open-source models. For open-source models, the cheapest provider for the same model can differ by 2-7× across the index, which is the unique decision this guide helps you make.

How often are prices in the nfer index refreshed?

API prices and GPU-rental rates refresh from each provider's pricing page or public price API. Most refresh daily; some reserved-capacity tiers refresh when the provider publishes a change. The methodology page documents the exact data sources and cadence.