Methodology

Overview#

nfer's pricing math is open and reproducible. Every number on the comparator is derived from a public, provider-published source plus the assumptions you set on the home page (daily token volume, input/output split, cache hit rate, active hours and days). GPU and self-host rows additionally run through a physics-based cost engine whose every constant is published in the live assumptions registry below. This page documents every assumption, every formula, and every limitation.

Data sources & cadence#

Every price is synced from the provider's own pricing page or public price API. There are no negotiated private rates in the dataset. Most pricing refreshes daily; tiers that don't publish on a clear public schedule (some reserved-capacity offerings, for example) are reviewed when a provider updates them.

You can find the latest update time on every model card in the comparator.

Pricing math#

API providers (per-token): monthly cost is the blended per-token rate times your monthly volume (daily token volume times active days per month).

monthly_cost
  = tokens_per_day × active_days_per_month
  × ( input_share × ( (1 − cache_hit_rate) × input_price
                    + cache_hit_rate × cached_input_price )
    + output_share × output_price )

The cache hit rate applies to input tokens: cached input is billed at the provider's published cached-input rate where one exists, otherwise at the normal input rate. Default assumptions: input_share = 0.20, output_share = 0.80, cache_hit_rate = 0. Each is editable in the home-page assumptions bar.

GPU rent (hourly): monthly cost is the hourly rate times the hours you actually run, set by the active-hours and active-days assumptions.

monthly_cost
  = hourly_price
  × active_hours_per_day
  × active_days_per_month

Defaults are 24 hours a day, 30 days a month (always-on serving); both are adjustable in the assumptions bar. Where a provider imposes a minimum commitment, billed hours are floored at that commitment, capped at 720 hours per month so annual terms are not charged in full each month.

Dedicated / reserved (hourly committed): same shape as GPU rent but with a fixed throughput ceiling. Where a provider doesn't publish a clear hourly equivalent, we approximate using their published reserved discount against on-demand list price; rows derived this way are flagged.

The GPU cost engine#

GPU and self-host rows are priced by a physics-based cost engine, not a rule of thumb. For every model and GPU pairing the engine answers two questions: does the model fit, and how fast does it run.

Memory fit. The VRAM a deployment needs is the sum of the model weights at their quantization, the KV cache at an assumed sequence length, and a fixed-plus-activation overhead. A configuration fits when that total stays within 90% of the hardware's VRAM (a 0.90 fit margin, matching vLLM's default memory utilization).

VRAM_required = weights + KV_cache + overhead
fits if VRAM_required ≤ GPU_VRAM × 0.90

Throughput. Token generation (decode) is bound by memory bandwidth: producing each new token requires streaming the model weights through the GPU. Prompt processing (prefill) is bound by compute. The engine models both phases from datasheet-verified GPU specs (memory bandwidth and dense tensor TFLOPS), scaled by per-GPU efficiency factors calibrated against published NVIDIA benchmarks. Estimates land within documented accuracy bands: about ±15% for a single GPU at full precision, widening to about ±35% for multi-GPU setups over PCIe.

Effective $/1M tokens. Each engine-covered GPU offering shows a per-token price derived from its own modeled performance: the hourly price divided by the blended throughput (decode and prefill weighted by the input/output split) at a serving-utilization assumption, 0.6 by default. Real fleets do not sustain lab-peak throughput; traffic burstiness and scheduler gaps put sustained rates at 50 to 70% of peak. The resulting figure is a property of the offering itself and does not move with your volume slider.

effective_$_per_1M
  = hourly_price
  / (blended_tokens_per_sec × 3600 × serving_utilization)
  × 1,000,000

Feasibility. Your tokens-per-day assumption is checked against each configuration's blended throughput at serving utilization. Configurations that cannot carry the workload are flagged, together with the multiple of them you would need. The capacity figure shown on a row is decode-only output-token capacity, a display metric; the feasibility check uses the blended basis.

Mixture-of-experts models. Where the active parameter count can be derived from the model's published configuration, the engine uses it for the compute estimate. Where it cannot, the model is treated as dense. That choice is conservative: it overstates the self-host cost rather than understating it.

Confidence badges. Every modeled row carries a high, medium, or low badge following the accuracy-band ladder: high for single-GPU, full-precision estimates; medium for quantized, multi-GPU, or mixture-of-experts cases; low for models without architecture data, where the engine falls back to parameter-count heuristics.

Coverage. Models not yet covered by the engine's precomputed table use the legacy VRAM-fit rule (weights times a 1.2 overhead factor) until the next recompute run picks them up.

The normative reference for every formula is docs/reference/gpu-model-matching-formulas.md in the nfer repository. The constants the engine runs with are published in the live assumptions section below.

API vs GPU rent break-even#

The break-even point is the daily token volume at which continuously running a rented GPU starts costing less than paying per-token API rates.

tokens_per_day_break_even
  = (gpu_hourly_price × 24)
  / blended_token_price

Worked example

Llama-3-70B Q4 deployment on a single A100 at $1.20/hour, with a blended token price of $0.40/1M tokens:

tokens_per_day_break_even
  = ($1.20 × 24)
  / $0.40 / 1,000,000
  ≈ 72M tokens / day

Below ~72M tokens/day, API is cheaper; above it, the rented GPU wins (assuming you can keep it busy). The comparator now does this arithmetic per row: every engine-covered GPU offering carries an effective $/1M derived from its modeled throughput at serving utilization, directly comparable with API per-token rates, and offerings too small for your tokens-per-day assumption are flagged. Rows served through the legacy fit fallback show their hourly and monthly prices without a modeled per-token figure.

Currency conversion#

Source prices are stored in their native currency (USD, EUR, GBP). Display currency is selected via the navbar picker. Conversion rates are pulled from authoritative public sources and refreshed regularly so prices stay close to live.

Live assumptions#

Every constant the cost engine uses lives in a public assumptions registry, and this table renders straight from that registry: what you read here is what production computes with. Each entry carries its value, its unit, the reasoning behind it, its sources, and the date it was last reviewed. Rows marked "adjustable in the assumptions bar" are defaults you can override on the home page; every other value is fixed in the engine and changes only through a reviewed update that bumps the engine version.

Loading the assumptions registry.

Limitations and known gaps#

Throughput is modeled, not measured. GPU performance figures are engine estimates within the documented accuracy bands, not benchmarks run on each provider's specific instances.
Reserved/committed pricing is approximated for providers that don't publish per-hour equivalents.
Owned-hardware and co-location deployments (buying GPUs and amortizing them) aren't yet covered by the comparator - self-hosting here means renting GPUs by the hour.
Fine-grained regional pricing (e.g. AWS region-by-region GPU rates) is collapsed to a representative region per provider; full per-region detail is on the roadmap.
Provider coverage is expanding continuously. If a provider you need is missing, please email and we'll prioritize it.

Contributing & corrections#

If a price looks wrong, please email the provider URL and the figure you expected to [email protected]. Most corrections land within a day.

Methodology

Overview#

Data sources & cadence#

Pricing math#

The GPU cost engine#

API vs GPU rent break-even#

Currency conversion#

Live assumptions#

Limitations and known gaps#

Contributing & corrections#

Popular models

Providers

Learn

About