PROFILE

Profile

Physics-aware optimization loop for vLLM inference servers.

What it does

  • Computes the hardware ceiling for your model + GPU. Shows how far below it you are.
  • Detects the bottleneck: batching gap, KV pressure, concurrency cap, prefix cache miss.
  • Measures before and after. Every recommendation is accountable.

Install

# Linux x86_64
wget https://github.com/jungledesh/profile/releases/latest/download/profile
chmod +x profile
./profile diagnose --url http://localhost:8000/metrics

# Rust toolchain
cargo install profile-vllm

Usage

profile diagnose --url http://HOST:8000/metrics
profile diagnose --url http://HOST:8000/metrics --duration 2m
profile diagnose --url http://HOST:8000/metrics -m 256

What profile detects

RuleSymptomFix
R1 Under-batchingGPU fed <25% of capacityRaise concurrency or fix client
R2 KV cache pressureKV usage ≥88% or preemptions activeReduce max_num_seqs or add memory
R3 Low prefix reuseHit rate <35%Enable prefix caching
R4 OOM riskWeights exceed VRAMAdd tensor parallelism
R5 Concurrency saturationmax_num_seqs hit, queue buildingRaise --max-num-seqs

Catalog

GPU specs, model parameters, and cloud prices resolved once at startup.

GPUs

FP32 TFLOPs and memory bandwidth feed the roofline math. Prices are market-average estimates as of 2026-06. Use --cost-per-hour for exact rates.

GPUArchFP32 TFLOPsMem BW GB/sOn-demand $/hrSpot $/hr
H100 PCIeHopper60.02,000$2.20$0.90
H100 SXMHopper67.03,350$3.45$1.85
H200Hopper67.04,800$3.50$2.00
A100 80GBAmpere19.52,039$1.50$0.60
A100 40GBAmpere19.51,555$1.50$0.60
A10GAmpere$1.00$0.40
L40SAda91.6864$1.80$0.80
RTX 4090Ada82.61,008
RTX Pro 6000 BlackwellBlackwell125.0960
B200Blackwell20.08,000$6.00$3.50
GB200Blackwell$8.50$4.00
GB10 (DGX Spark)Blackwell20.0273
FP32 TFLOPs are non-tensor-core throughput; the conservative roofline input. All values labeled (est) in output. A10G and GB200 have price entries only; roofline is skipped for those GPUs. A plain "A100" without a size token (80GB or 40GB) in the NVML name will not match; Profile requires the size to disambiguate. Prices sampled across AWS, GCP, Azure, Lambda, Nebius, RunPod, CoreWeave, and GMI Cloud.

Models

Matched by token against the model name vLLM reports at startup.

ModelTotal paramsActive paramsMoE
Llama 4 Maverick400B17BYes
Llama 4 Scout109B17BYes
Llama 3 405B405BNo
Llama 3 70B70BNo
Llama 3 8B8BNo
Nemotron 70B70BNo
Nemotron 8B8BNo
Qwen3 235B235B22BYes
Qwen3 72B72BNo
Qwen3 32B32BNo
Qwen3 30B30B3BYes
Qwen3 14B14BNo
Qwen3 7B7BNo
Qwen2.5 72B72BNo
Qwen2.5 32B32BNo
Qwen2.5 14B14BNo
Qwen2.5 7B7BNo
DeepSeek 671B / R1 / V3671B37BYes
DeepSeek 70B70BNo
DeepSeek 7B7BNo
Mistral Large 675B675B52BYes
Mistral Large 123B123BNo
Mixtral 8x22B141B39BYes
Mixtral 8x7B47B13BYes
Mistral 7B7BNo
Gemma 27B27BNo
Gemma 9B9BNo
Kimi K21,000B32BYes
GLM 744B744B56BYes
GLM 32B32BNo
Phi 4 32B32BNo
Phi 4 14B14BNo
MoE models: active_param_count drives roofline ceilings; param_count (total) drives weight_gb and OOM headroom. Unrecognized models skip roofline entirely. To add a model or GPU, open an issue or submit a PR to the catalog files in src/context/.

Math

The formulas Profile runs on collected data.

Profile measures behavior under load. Idle time has no waste to cut. Ceilings are catalog-derived upper bounds, labeled (est). Missing data stays absent (None), never guessed.

Roofline: hardware ceilings

Baseline inputs

InputResolution
roofline_paramsactive_param_count if present, else param_count. Decode and prefill ceilings, efficiency %, ridge batch size.
weight_paramsparam_count if present, else active_param_count. weight_gb and kv_headroom_gb (OOM check).
bytes_per_paramDTYPE / VLLM_DTYPE env → kv_cache_dtype → catalog default → bf16 (2 bytes). fp8 = 1, fp16/bf16 = 2, fp32 = 4. Source labeled in output when fallback is used.
tensor_parallel_size--tensor-parallel-size CLI flag → TENSOR_PARALLEL_SIZE / VLLM_TENSOR_PARALLEL_SIZE env var → 1. Scales both ceilings and per-GPU KV headroom.
seq_len (prefill only)prompt_tokens_mean rounded → max_model_len → absent: prefill ceiling skipped.

Two ceilings, two binding constraints. LLM serving is memory-bandwidth-bound at decode and compute-bound at prefill.

FormulaResultBinding constraint
(peak_bw_gbps × tp) × 1e9 / (params × bytes_per_param) Decode ceiling (tok/s) Memory bandwidth, binding in steady-state serving. tp GPUs contribute tp× aggregate bandwidth.
(peak_flops × tp) × 1e12 / (6 × params × seq_len) Prefill ceiling (tok/s) Compute, binding during prompt processing. tp GPUs contribute tp× aggregate FLOPs.
expected × 0.85 / expected × 1.05 Ceiling range (lower / upper) −15% / +5% band around expected. Reflects catalog approximation, not measured hardware.
(peak_flops × 1e12 × bytes_per_param) / (peak_bw × 1e9) Ridge batch size Concurrent batch at which decode crosses from BW-bound to compute-bound. Below: BW limits throughput. At or above: compute limits throughput. Prefill floor display suppressed when num_running ≥ ridge_batch_size.

peak_bw_gbps, peak_flops, and params come from catalogs, not live measurement. All ceiling outputs labeled (est).

Efficiency %

aggregate_ceiling = decode_ceiling_tps × num_requests_running
efficiency_pct    = actual_tps / aggregate_ceiling × 100

decode_ceiling_tps is TP-scaled (peak_bw × tp from the roofline above). Requires num_requests_running > 0 and generation_tokens_per_sec > 0. Not shown if actual_tps > aggregate_ceiling: that indicates a catalog or measurement mismatch.

Calibrated for decode-bound workloads. Prefill-heavy workloads (long prompts, short outputs) show artificially low efficiency %. The prefill ceiling is the relevant constraint there, not decode.

Weight footprint and headroom

weight_gb        = weight_params × bytes_per_param / 1e9
kv_headroom_gb   = vram_total_gb − (weight_gb / tp)
headroom_pct     = 100 − min(efficiency_pct, 100)
tpot_floor_ms    = 1000 / decode_ceiling_tps
prefill_floor_ms = 1000 / prefill_ceiling_tps

kv_headroom_gb: per-GPU; each GPU holds weight_gb/N with TP=N. Negative means current TP is insufficient. headroom_pct: unused fraction of the decode ceiling. tpot_floor_ms and prefill_floor_ms: theoretical minimum latencies at ceiling; actual values above these indicate overhead.

Economics

MetricFormula
tok / Wgeneration_tokens_per_sec / power_watts
J / tokenpower_watts / generation_tokens_per_sec
$ / 1M tokens(cost_per_hr / tokens_per_hr) × 1e6
Recoverable $/hrcost_per_hr × (1 − efficiency_pct / 100)

tokens_per_hr = generation_tokens_per_sec × 3600. You pay cost_per_hr for that many tokens; scale to 1M.

Cost priority: --cost-per-hour flag, then GPU catalog on-demand price, then absent. Catalog-derived $/1M tok is labeled (est) in output; user-provided rates are not.

Closed-loop delta

throughput_delta_pct = (after − before) / before × 100
efficiency_delta_pp  = efficiency_pct_after − efficiency_pct_before

Computed after every re-measure cycle. Direction uses efficiency_delta_pp, not raw throughput. Dividing by num_running (time-weighted mean across the window) cancels traffic-induced concurrency changes; what remains is per-request hardware utilization.

SignalBetterWorsePlateau
efficiency_delta_pp (primary) > +2.0 pp, and ttft_p99 not regressed > 20% < −5.0 pp between, or latency veto fired
throughput_delta_pct (fallback when efficiency unavailable) > +10% < −10% between or both absent

Latency veto: if efficiency says Better but ttft_p99_delta_pct > +20%, direction is Plateau. Downgrades Better only; does not affect Plateau or Worse. Skipped on the throughput fallback path.

On Worse: loop pauses. Operator chooses [r] revert or [c] continue. On continue, the degraded state becomes the new baseline before the next iteration. On revert, baseline is unchanged.

All thresholds are provisional constants. Latency, cost, and p99 before/after are tracked alongside direction for display.

Per-window collection (~2s)

Each window polls at ~250ms intervals. GPU (NVML) and vLLM are collected in parallel threads on each poll. These formulas produce one snapshot before multi-window aggregation.

Metric typeFormulaOn failure
Histogram mean (TTFT, TPOT, prefill, queue, prompt tokens)Δsum / Δcount first→last scrape; ×1000 for seconds-based latenciesLast-scrape cumulative mean when Δcount ≤ 0
Histogram p99Delta buckets, then linear interpolation at q=0.99None. No cumulative fallback (stale p99 is worse than none)
Counter rate (tok/s, req/s, preemptions/s)(last − first) / window_duration_secsNone on negative delta or zero duration. Zero delta is valid idle, not missing
Prefix cache hit rateΔhits / ΔqueriesNone when Δqueries ≤ 0
GPU util, powerMean across window pollsPoll skipped if NVML field absent
VRAM, temp (current)Last pollPeaks (vram_peak_mb, temperature_peak_c, kv_cache_peak_perc): max across polls/scrapes

Collection cadence and edge cases: Data.

Aggregation: combining windows

Each ~2s window produces one snapshot. Fields aggregate by type across windows. Active: server under real load (requests running, KV or GPU util above threshold). Evaluable: endpoint responded with valid data.

Field typeWindow setMethod
GPU util, power, running/waiting means, rates (tok/s, req/s)ActiveTime-weighted mean: Σ(value × duration) / Σ(duration)
Histogram means (TTFT, TPOT, prefill, queue latency)ActiveΣΔsum / ΣΔcount across active windows. Weighted by observation count, not time.
p99 (TTFT, TPOT)ActiveMerge per-window delta bucket vectors (sum counts at matching boundaries), recompute q=0.99 via linear interpolation. Never average scalar p99 values.
KV cache avg, prefix cache hit rate, prompt_tokens_meanEvaluableKV avg: time-weighted mean. Prefix hit rate: Σ Δhits / Σ Δqueries across all evaluable windows.
KV / VRAM / temp peaksEvaluablemax(per-window peak, last evaluable landing value) so aggregate peak ≥ displayed current.
State gauges (VRAM used, temp, sm_clock)Evaluable (last)Last evaluable window's landing value.
Cumulative counters (total tokens, total reqs)All (chronological last)Chronologically last collected window. Idle tail included. Preserves true Prometheus server totals.
All windows non-evaluableChronologically last raw window returned in full.

Rules

Five rules. Each fires when a specific bottleneck is confirmed under load.

Rules evaluate on structurally valid windows (window_is_evaluable). Recommendations surface only when a signal is persistent across evaluable windows.

R1: Under-batching

ConditionThreshold
Scheduler occupancy<25% of max_num_seqs
No backlogwaiting < 2
Prefill not saturatedΔprefill_sum / window_secs < 0.40 AND Δprefill_sum < 4.0s

Prefill saturation uses histogram window mass (Δsum / window_secs), not the roofline ceiling. A window with more than 40% of its duration spent in prefill, or more than 4 absolute seconds, is not under-batched.

When the prefill gate suppresses R1, verbose mode (-v) shows why: Under-batching: not triggered (prefill saturated at 55%). Without -v, the rule appears as not triggered with no reason.

Fix: Raise client concurrency.

R2: KV cache pressure

R2 has two sub-variants. Both surface under the same [!] KV Cache Pressure header. The main path suppresses the backlog path when both fire.

Main path

ConditionThreshold
KV cache usage≥88% (or preemptions active)

VRAM ≥78% is shown in output as corroborating evidence when present, but is not a required condition. R2 fires on KV usage or preemptions alone.

Fix: Reduce --max-num-seqs or --max-model-len.

Admission backlog path

ConditionThreshold
Queue ratiowaiting / (running + waiting) ≥ 30%
Free KV tokens< demand from queued requests (waiting × prompt_tokens_mean)
Concurrency caprunning < max_num_seqs (scheduler not at cap)

Fires when the scheduler is holding requests in queue to protect KV memory, before KV usage crosses the 88% threshold. The fix is to expand the KV pool, not reduce concurrency.

Fix: Raise --gpu-memory-utilization if VRAM headroom exists; switch to fp8 KV cache; or reduce --max-model-len.

R2 evaluates against kv_cache_peak_perc, the maximum KV usage across all polls in the window, not the average. A window that averaged 70% but peaked at 92% will still fire. The spike is what causes preemptions. The average hides it.
R2 is the only rule that cross-references a live GPU reading (VRAM from NVML) with a live vLLM reading (KV cache usage). If gpu_observed_at and vllm_observed_at are more than 1 second apart, R2 is skipped. The other rules draw from a single source and don't have this constraint.

R3: Low prefix reuse

ConditionThreshold
Prefix cache hit rate<35%
Mean prompt tokens≥20
Request rate≥5 req/s
Running requests>0.75

Fix: If prefix caching is disabled: enable --enable-prefix-caching, then restructure prompts. If already enabled: move shared instructions to the start, standardize templates, avoid unique tokens at the beginning.

R4: OOM risk

Fires regardless of traffic. Configuration fact, not runtime observation.

FormulaWhat it computes
usable_vram = (vram_total_gb × gpu_mem_util) − 3.0 GB VRAM available after activation and KV buffer reservation (3 GB constant)
min_tp = ceil(weight_gb / usable_vram) Minimum tensor parallel degree to fit the model

The 3 GB buffer (ACTIVATION_KV_BUFFER_GB) reserves headroom for activations and initial KV blocks. R4 fires when kv_headroom_gb < 0 (weight_gb / tp > per-GPU VRAM). min_tp uses gpu_memory_utilization (default 0.90) for the fix recommendation.

Fix: Profile computes required TP degree: "Set --tensor-parallel-size N".

R5: Concurrency saturation

ConditionThreshold
running ≈ max_num_seqsWithin 0.5
Queue ratiowaiting / (running + waiting) ≥ 30%
Sustained≥3 windows, ≥25% of evaluable
KV cache headroomKV peak <80% (falls back to avg when peak absent)

Fix: Raise --max-num-seqs if KV peak <80%. If KV is at or above 80%, raising the cap will cause thrashing; add a replica or lower --max-model-len instead.

vLLM ≤0.18.0 doesn't expose vllm_max_num_seqs. Pass -m <value> or R5 won't fire.

Confidence

Per-rule, not a unified formula. Each rule sets confidence based on what signals it has.

RuleConfidenceCondition
R10.80Fixed
R20.95Preemptions active
R20.85KV cache ≥ 95%
R20.70KV cache ≥ 88% (fire threshold), no preemptions
R20.85KV admission backlog variant
R30.90Fixed
R40.95dtype from DTYPE / VLLM_DTYPE env var
R40.90dtype from kv_cache_dtype or catalog
R40.60dtype from bf16 fallback
R50.90TTFT (mean or p99) and KV cache both present
R50.60One or both absent

Rule significance gate

rule_is_significant = fired ≥ 3  AND  fired / total_evaluable_windows ≥ 25%

A rule that fired once is noise. Profile only surfaces a recommendation when the signal is persistent: at least 3 windows and at least 25% of all evaluable windows in the run.

R2 main path exception. Critical KV conditions bypass the standard gate early:

ConditionWindows required
Any window with active preemptions1 (significant immediately)
KV ≥ 95% in ≥ 2 windows2
Standard gate (KV 88–95%, no preemptions)≥ 3 AND ≥ 25%

The admission backlog path and all other rules use only the standard gate.

Ranking

score = impact × confidence  // impact: 1–5, confidence: 0.0–1.0

Rules sorted by score, highest first. Grouped by root cause into IssueGroup. One primary recommendation surfaced per group: the rest are supporting evidence, not separate alerts.

Data

What, when, and how we collect.

Cadence

GPU and vLLM scraped in parallel, 250ms apart. One window is ~2 seconds, ~9 polls. Each collector timestamps on completion. If the two timestamps diverge by more than 1 second, rules that cross-reference both sources are skipped.

Within each window, different metric types are treated differently:

TypeTreatmentEdge cases
Gauges Last scrape in the window kv_cache_peak_perc is max across all scrapes in the window, not last. KV cache can spike and recover within a window; last scrape alone misses it. Shown in output when peak > avg + 10pp or peak ≥ 95%. Same pattern for VRAM peak (≥90% of total) and GPU temp peak (≥80°C).
Histograms Δsum/Δcount, first to last scrape Falls back to last-scrape cumulative mean when Δcount ≤ 0 (no new completions in window). p99 has no fallback; stale p99 is worse than none.
Counters Rates = Δ / window duration. Raw totals from last scrape. None on counter reset (negative delta) or zero-duration window. Zero delta is valid: no activity, not missing data.

Evaluable window gate

Two-tier gate. Structural validity and active traffic are separate checks.

Tier 1: structural. Did the endpoint respond and does the window have a valid duration? All rules use this check.

window_duration_secs  is present and positive
num_requests_running  is Some (even if zero)

running = 0 is valid data: the server responded and reported no active requests. None means the endpoint did not respond.

Tier 2: active. Was the server doing real work? Aggregated means use this check. Throughput, efficiency %, GPU utilization, and latency averages are computed over active windows only.

running_reqs > 0
AND (kv_cache_pct > 30%  OR  gpu_util_pct > 20%)

// both absent: running_reqs > 0 alone
// one absent:  evaluate the other arm only

kv_cache_pct catches decode and sustained load. gpu_util_pct catches prefill bursts where the KV cache is still near-empty. Idle windows pass tier 1 but not tier 2 and are excluded from the averages.

Multi-window aggregation, which window set applies to which field, is in Math.

Source 1: NVML

NVIDIA Management Library. Direct access to GPU hardware state, polled at 250ms intervals.

FieldTypeWhat it isHow collected
gpu_util_pctGaugeFraction of time any kernel was executing. Not SM occupancy.Mean across window polls
mem_util_pctGaugeFraction of time the memory controller was busyMean across window polls
power_wattsGaugePower draw (W)Mean across window polls
power_limit_wattsGaugeDriver-set TDP limit (W)Once at startup
vram_used_mbGaugeVRAM in use (MiB)Last poll
vram_total_mbGaugeTotal device VRAM (MiB)Last poll
temperature_cGaugeGPU core temp (°C)Last poll
sm_clock_mhzGaugeSM clock frequency (MHz)Last poll
Two values are computed from the polls, not read directly from NVML: vram_peak_mb (max of vram_used_mb across all polls) and temperature_peak_c (max of temperature_c across all polls). Derived within the window, not collected.
NVML does not expose theoretical peak FLOPs or memory bandwidth. Those come from the GPU catalog, looked up once at startup from gpu_name, labeled (est) in output.

Source 2: vLLM Prometheus

Scraped from /metrics at ~250ms intervals, ~9 times per window. See vLLM docs for the full metric reference.

FieldTypeWhat it is
vllm:num_requests_runningGaugeActive requests in flight
vllm:num_requests_waitingGaugeRequests queued, not yet scheduled
vllm:num_requests_swappedGaugeRequests with KV blocks evicted to CPU
vllm:kv_cache_usage_perc / vllm:gpu_cache_usage_percGaugeKV cache fill ratio, 0–1. Primary name is kv_cache_usage_perc; falls back to gpu_cache_usage_perc on older vLLM.
vllm:cpu_cache_usage_percGaugeCPU KV cache fill ratio
vllm:time_to_first_token_secondsHistogramTime from request arrival to first output token. TTFT mean + p99.
vllm:request_time_per_output_token_secondsHistogramPer-token decode latency. TPOT mean + p99.
vllm:time_per_output_token_secondsHistogramTPOT, older metric name. Fallback when primary is absent.
vllm:request_prefill_time_secondsHistogramPrefill latency per request
vllm:request_queue_time_secondsHistogramTime spent waiting before scheduling
vllm:request_prompt_tokensHistogramPrompt token count per request
vllm:generation_tokens_total / vllm:iteration_tokens_total_sumCounterCumulative tokens generated. Primary is generation_tokens_total; falls back to iteration_tokens_total_sum on older vLLM.
vllm:request_success_totalCounterCumulative completed requests. Legacy name: vllm:request_success.
vllm:num_preemptions_totalCounterCumulative KV cache preemptions. Legacy name: vllm:num_preemptions.
vllm:cache_config_infoGauge (labels)Static cache config: block_size, cache_dtype, prefix_caching, chunked_prefill
vllm:max_num_seqsGaugeConcurrency cap. Absent in vLLM ≤0.18.0; pass -m.
vllm:prefix_cache_hits_total, vllm:prefix_cache_queries_total, vllm:external_prefix_cache_hits_total, vllm:external_prefix_cache_queries_totalCounterInternal and external prefix cache hits and queries. Legacy names without _total also supported.

Catalog-derived fields

Looked up once at startup. See Catalog for supported GPUs and models.

FieldSourceIf absent
peak_flops_f32_tflops, peak_bw_gbpsGPU catalog. Non-tensor-core FP32, conservative roofline input.Ceilings not computed
param_count, active_param_countModel catalog. MoE: total for weight/OOM; active for roofline.Roofline skipped
bytes_per_paramDTYPE / VLLM_DTYPE env → kv_cache_dtype → catalog default → bf16 (2 bytes)bf16 fallback, labeled in output
All catalog-derived numbers are labeled (est) in output. Upper-bound approximations, not measured values.

Fallbacks

FieldPrimaryFallbackIf absent
max_num_seqsvllm:max_num_seqs gauge-m CLI flagR1 and R5 silent
tensor_parallel_size--tensor-parallel-size CLI flagTENSOR_PARALLEL_SIZE / VLLM_TENSOR_PARALLEL_SIZE env varTreated as 1; ceilings and kv_headroom unscaled
TPOT histogramrequest_time_per_output_tokentime_per_output_tokenNone
bytes_per_paramDTYPE / VLLM_DTYPE env varkv_cache_dtype → catalog → bf16bf16, labeled in output
All windows non-evaluableChronologically last raw windowLabeled as non-evaluable

Limitations (wip**)

Where the math is approximate and why.

GapAffectsIn practice
MoE weight uses catalog total param_countR4, baselineRuntime VRAM may be lower if not all experts are loaded
TP degree not reported by vLLM metrics; must be set via flag or env varBaseline, R4If --tensor-parallel-size / TENSOR_PARALLEL_SIZE absent, Profile assumes TP=1 and underestimates ceilings on multi-GPU deployments
Efficiency % uses theoretical BW ceiling, not measured HBMAll rulesCan't distinguish BW bottleneck from scheduler under-feeding
FLOPs/s not measurable via NVMLR1Prefill-heavy workloads show artificially low efficiency %
vllm_max_num_seqs absent in vLLM ≤0.18.0R5Requires -m flag
Chunked prefill: run can exceed max_num_seqsR5R5 skips. Cap is not the constraint
GPU and model specs from catalog, not measuredBaselineOutput labels ceilings (est)
p99 absent on counter reset or zero-traffic windowDisplayBy design. Stale p99 is worse than none.

Design (wip**)

The non-obvious choices and the reasoning behind them.

Option<T> over sentinels

Some(0.0) ≠ None. Every metric is optional. Missing data and zero are different. Treating them the same produces confident wrong answers.

Physics ceiling as a range

CeilingEstimate { lower, expected, upper }. Catalog-derived specs carry uncertainty. Output labels ceilings (est).

One recommendation per iteration

Rules grouped by root cause. One primary signal per group. Five simultaneous recommendations produce paralysis, not action.

Delta, not snapshot

Each iteration reports what changed since the last: direction, magnitude, likely cause. A snapshot without temporal comparison is a status report, not a diagnosis.

Shared traffic gate

window_is_evaluable is one function, called once, used by all rules. Idle state has no waste to measure.

Engine never imports CLI

The reasoning layer is deterministic and testable in isolation. Adding a new interface (SGLang, API) doesn't touch the engine.

250ms collection interval

Fast enough to catch GPU utilization transients that a 1s or 5s poll would average away. Slow enough that scrapes per window add no meaningful load to the vLLM endpoint or NVML.

Peak over average for KV cache

KV cache usage is tracked as the maximum across all scrapes in a window, not the last value. A spike that causes preemptions and latency degradation can recover before the window ends. The last scrape reports a safe number while the damage already happened. The peak is the signal. The average buries it.

Zero allocation in the hot loop

AnalysisInput<'a> borrows both contexts. No clones per 250ms window. A profiler that eats its target's resources is worse than no profiler.

Security

rustls only, no OpenSSL. Read-only NVML. No external API calls. No telemetry. cargo audit clean before every release tag.

The Inference Problem

TL;DR

  • Inference is 80–90% of your AI system's lifetime cost, not training.
  • Production servers bleed compute across batching gaps, KV cache pressure, prefill overhead, low throughput, and high latency.
  • The problems are fixable. Most teams can't see them.

You send a prompt. A GPU somewhere processes it and sends back a response, one token at a time. That loop is inference. It is also where most of the money goes.

The AI inference cycle
The AI inference cycle. Full post.
80–90% of a production AI system's lifetime cost is inference, not training [1]
15× GPT-4 inference cost vs training cost. $2.3B in inference by end of 2024 vs ~$150M to train

Training gets the headlines. Inference pays the bills.

An H100 rents for roughly $3/hr. When your GPU is underutilized, you get fewer tokens than the hardware can deliver. Same bill, lower output. That is over $1,000/month in cloud compute per GPU, wasted.

Most teams are bad at it. The waste is in:

Throughput. A misconfigured vLLM server runs well below its hardware ceiling. When batching, memory, and scheduling are off, GPU utilization collapses. The hardware is not the problem. The configuration is.

Latency. Real-time applications need time-to-first-token under 100ms. Most production setups don't get there without deliberate tuning.

KV cache pressure. Every token generated needs to remember everything before it. When that memory fills up, the system evicts. Throughput drops. Latency spikes. At 90% usage, you're already in trouble.

Batching. When batch size is 2 out of a possible 16, you're using 12% of the machine. The rest idles. You pay full price.

Prefill overhead. Before the first output token, the model processes your entire prompt. In workloads with 500+ token prompts, prefill alone dominates response time.

Observability. Most teams look at raw Prometheus metrics and guess. There is no standard path from "throughput is low" to "here is exactly why, here is the fix."

These are not edge cases. This is the normal state of a production inference server that has not been tuned.

Why This Matters Now

The standard story about early tech: Amazon lost money for years. Uber still does. Growth first, economics later. AI inference does not fit that story.

Amazon chose to lose money, a deliberate bet on demand. Companies running AI inference are burning money on hardware they already paid for, because it is sitting misconfigured. That is not a growth strategy. It is waste that shows up on your cloud compute bill this month.

Enterprises running production AI spend $50,000 to $500,000 per month on inference infrastructure. [2] Inference costs have dropped 280× since 2022, from $20 to $0.07 per million tokens. That gap between what hardware can do and what most teams extract from it is your cost problem today.

"The AI inferencing market will be much, much larger than the AI training market. People are running out of usable inference computing capacity." — Larry Ellison, Oracle earnings call [3]

The shortage is not GPUs. It is efficient use of the GPUs already running in production, including yours.