Profile

Physics-aware optimization loop for vLLM inference servers.

What it does

Computes the hardware ceiling for your model + GPU. Shows how far below it you are.
Detects the bottleneck: batching gap, KV pressure, concurrency cap, prefix cache miss.
Measures before and after. Every recommendation is accountable.

Install

# Linux x86_64
wget https://github.com/jungledesh/profile/releases/latest/download/profile
chmod +x profile
./profile diagnose --url http://localhost:8000/metrics

# Rust toolchain
cargo install profile-vllm

Usage

profile diagnose --url http://HOST:8000/metrics
profile diagnose --url http://HOST:8000/metrics --duration 2m
profile diagnose --url http://HOST:8000/metrics -m 256

What profile detects

Rule	Symptom	Fix
R1 Under-batching	GPU fed <25% of capacity	Raise concurrency or fix client
R2 KV cache pressure	KV usage ≥88% or preemptions active	Reduce max_num_seqs or add memory
R3 Low prefix reuse	Hit rate <35%	Enable prefix caching
R4 OOM risk	Weights exceed VRAM	Add tensor parallelism
R5 Concurrency saturation	max_num_seqs hit, queue building	Raise --max-num-seqs

Catalog

GPU specs, model parameters, and cloud prices resolved once at startup.

GPUs

FP32 TFLOPs and memory bandwidth feed the roofline math. Prices are market-average estimates as of 2026-06. Use --cost-per-hour for exact rates.

GPU	Arch	FP32 TFLOPs	Mem BW GB/s	On-demand $/hr	Spot $/hr
H100 PCIe	Hopper	60.0	2,000	$2.20	$0.90
H100 SXM	Hopper	67.0	3,350	$3.45	$1.85
H200	Hopper	67.0	4,800	$3.50	$2.00
A100 80GB	Ampere	19.5	2,039	$1.50	$0.60
A100 40GB	Ampere	19.5	1,555	$1.50	$0.60
A10G	Ampere	—	—	$1.00	$0.40
L40S	Ada	91.6	864	$1.80	$0.80
RTX 4090	Ada	82.6	1,008	—	—
RTX Pro 6000 Blackwell	Blackwell	125.0	960	—	—
B200	Blackwell	20.0	8,000	$6.00	$3.50
GB200	Blackwell	—	—	$8.50	$4.00
GB10 (DGX Spark)	Blackwell	20.0	273	—	—

FP32 TFLOPs are non-tensor-core throughput; the conservative roofline input. All values labeled (est) in output. A10G and GB200 have price entries only; roofline is skipped for those GPUs. A plain "A100" without a size token (80GB or 40GB) in the NVML name will not match; Profile requires the size to disambiguate. Prices sampled across AWS, GCP, Azure, Lambda, Nebius, RunPod, CoreWeave, and GMI Cloud.

Models

Matched by token against the model name vLLM reports at startup.

Model	Total params	Active params	MoE
Llama 4 Maverick	400B	17B	Yes
Llama 4 Scout	109B	17B	Yes
Llama 3 405B	405B	—	No
Llama 3 70B	70B	—	No
Llama 3 8B	8B	—	No
Nemotron 70B	70B	—	No
Nemotron 8B	8B	—	No
Qwen3 235B	235B	22B	Yes
Qwen3 72B	72B	—	No
Qwen3 32B	32B	—	No
Qwen3 30B	30B	3B	Yes
Qwen3 14B	14B	—	No
Qwen3 7B	7B	—	No
Qwen2.5 72B	72B	—	No
Qwen2.5 32B	32B	—	No
Qwen2.5 14B	14B	—	No
Qwen2.5 7B	7B	—	No
DeepSeek 671B / R1 / V3	671B	37B	Yes
DeepSeek 70B	70B	—	No
DeepSeek 7B	7B	—	No
Mistral Large 675B	675B	52B	Yes
Mistral Large 123B	123B	—	No
Mixtral 8x22B	141B	39B	Yes
Mixtral 8x7B	47B	13B	Yes
Mistral 7B	7B	—	No
Gemma 27B	27B	—	No
Gemma 9B	9B	—	No
Kimi K2	1,000B	32B	Yes
GLM 744B	744B	56B	Yes
GLM 32B	32B	—	No
Phi 4 32B	32B	—	No
Phi 4 14B	14B	—	No

MoE models: active_param_count drives roofline ceilings; param_count (total) drives weight_gb and OOM headroom. Unrecognized models skip roofline entirely. To add a model or GPU, open an issue or submit a PR to the catalog files in src/context/.

Math

The formulas Profile runs on collected data.

Profile measures behavior under load. Idle time has no waste to cut. Ceilings are catalog-derived upper bounds, labeled (est). Missing data stays absent (None), never guessed.

Roofline: hardware ceilings

Baseline inputs

Input	Resolution
`roofline_params`	`active_param_count` if present, else `param_count`. Decode and prefill ceilings, efficiency %, ridge batch size.
`weight_params`	`param_count` if present, else `active_param_count`. `weight_gb` and `kv_headroom_gb` (OOM check).
`bytes_per_param`	`DTYPE` / `VLLM_DTYPE` env → `kv_cache_dtype` → catalog default → bf16 (2 bytes). fp8 = 1, fp16/bf16 = 2, fp32 = 4. Source labeled in output when fallback is used.
`tensor_parallel_size`	`--tensor-parallel-size` CLI flag → `TENSOR_PARALLEL_SIZE` / `VLLM_TENSOR_PARALLEL_SIZE` env var → 1. Scales both ceilings and per-GPU KV headroom.
`seq_len` (prefill only)	`prompt_tokens_mean` rounded → `max_model_len` → absent: prefill ceiling skipped.

Two ceilings, two binding constraints. LLM serving is memory-bandwidth-bound at decode and compute-bound at prefill.

Formula	Result	Binding constraint
`(peak_bw_gbps × tp) × 1e9 / (params × bytes_per_param)`	Decode ceiling (tok/s)	Memory bandwidth, binding in steady-state serving. `tp` GPUs contribute `tp×` aggregate bandwidth.
`(peak_flops × tp) × 1e12 / (6 × params × seq_len)`	Prefill ceiling (tok/s)	Compute, binding during prompt processing. `tp` GPUs contribute `tp×` aggregate FLOPs.
`expected × 0.85` / `expected × 1.05`	Ceiling range (lower / upper)	−15% / +5% band around expected. Reflects catalog approximation, not measured hardware.
`(peak_flops × 1e12 × bytes_per_param) / (peak_bw × 1e9)`	Ridge batch size	Concurrent batch at which decode crosses from BW-bound to compute-bound. Below: BW limits throughput. At or above: compute limits throughput. Prefill floor display suppressed when `num_running ≥ ridge_batch_size`.

peak_bw_gbps, peak_flops, and params come from catalogs, not live measurement. All ceiling outputs labeled (est).

Efficiency %

aggregate_ceiling = decode_ceiling_tps × num_requests_running
efficiency_pct    = actual_tps / aggregate_ceiling × 100

decode_ceiling_tps is TP-scaled (peak_bw × tp from the roofline above). Requires num_requests_running > 0 and generation_tokens_per_sec > 0. Not shown if actual_tps > aggregate_ceiling: that indicates a catalog or measurement mismatch.

Calibrated for decode-bound workloads. Prefill-heavy workloads (long prompts, short outputs) show artificially low efficiency %. The prefill ceiling is the relevant constraint there, not decode.

Weight footprint and headroom

weight_gb        = weight_params × bytes_per_param / 1e9
kv_headroom_gb   = vram_total_gb − (weight_gb / tp)
headroom_pct     = 100 − min(efficiency_pct, 100)
tpot_floor_ms    = 1000 / decode_ceiling_tps
prefill_floor_ms = 1000 / prefill_ceiling_tps

kv_headroom_gb: per-GPU; each GPU holds weight_gb/N with TP=N. Negative means current TP is insufficient. headroom_pct: unused fraction of the decode ceiling. tpot_floor_ms and prefill_floor_ms: theoretical minimum latencies at ceiling; actual values above these indicate overhead.

Economics

Metric	Formula
tok / W	`generation_tokens_per_sec / power_watts`
J / token	`power_watts / generation_tokens_per_sec`
$ / 1M tokens	`(cost_per_hr / tokens_per_hr) × 1e6`
Recoverable $/hr	`cost_per_hr × (1 − efficiency_pct / 100)`

tokens_per_hr = generation_tokens_per_sec × 3600. You pay cost_per_hr for that many tokens; scale to 1M.

Cost priority: --cost-per-hour flag, then GPU catalog on-demand price, then absent. Catalog-derived $/1M tok is labeled (est) in output; user-provided rates are not.

Closed-loop delta

throughput_delta_pct = (after − before) / before × 100
efficiency_delta_pp  = efficiency_pct_after − efficiency_pct_before

Computed after every re-measure cycle. Direction uses efficiency_delta_pp, not raw throughput. Dividing by num_running (time-weighted mean across the window) cancels traffic-induced concurrency changes; what remains is per-request hardware utilization.

Signal	Better	Worse	Plateau
`efficiency_delta_pp` (primary)	> +2.0 pp, and ttft_p99 not regressed > 20%	< −5.0 pp	between, or latency veto fired
`throughput_delta_pct` (fallback when efficiency unavailable)	> +10%	< −10%	between or both absent

Latency veto: if efficiency says Better but ttft_p99_delta_pct > +20%, direction is Plateau. Downgrades Better only; does not affect Plateau or Worse. Skipped on the throughput fallback path.

On Worse: loop pauses. Operator chooses [r] revert or [c] continue. On continue, the degraded state becomes the new baseline before the next iteration. On revert, baseline is unchanged.

All thresholds are provisional constants. Latency, cost, and p99 before/after are tracked alongside direction for display.

Per-window collection (~2s)

Each window polls at ~250ms intervals. GPU (NVML) and vLLM are collected in parallel threads on each poll. These formulas produce one snapshot before multi-window aggregation.

Metric type	Formula	On failure
Histogram mean (TTFT, TPOT, prefill, queue, prompt tokens)	`Δsum / Δcount` first→last scrape; ×1000 for seconds-based latencies	Last-scrape cumulative mean when `Δcount ≤ 0`
Histogram p99	Delta buckets, then linear interpolation at q=0.99	`None`. No cumulative fallback (stale p99 is worse than none)
Counter rate (tok/s, req/s, preemptions/s)	`(last − first) / window_duration_secs`	`None` on negative delta or zero duration. Zero delta is valid idle, not missing
Prefix cache hit rate	`Δhits / Δqueries`	`None` when `Δqueries ≤ 0`
GPU util, power	Mean across window polls	Poll skipped if NVML field absent
VRAM, temp (current)	Last poll	Peaks (`vram_peak_mb`, `temperature_peak_c`, `kv_cache_peak_perc`): max across polls/scrapes

Collection cadence and edge cases: Data.

Aggregation: combining windows

Each ~2s window produces one snapshot. Fields aggregate by type across windows. Active: server under real load (requests running, KV or GPU util above threshold). Evaluable: endpoint responded with valid data.

Field type	Window set	Method
GPU util, power, running/waiting means, rates (tok/s, req/s)	Active	Time-weighted mean: `Σ(value × duration) / Σ(duration)`
Histogram means (TTFT, TPOT, prefill, queue latency)	Active	`ΣΔsum / ΣΔcount` across active windows. Weighted by observation count, not time.
p99 (TTFT, TPOT)	Active	Merge per-window delta bucket vectors (sum counts at matching boundaries), recompute q=0.99 via linear interpolation. Never average scalar p99 values.
KV cache avg, prefix cache hit rate, `prompt_tokens_mean`	Evaluable	KV avg: time-weighted mean. Prefix hit rate: `Σ Δhits / Σ Δqueries` across all evaluable windows.
KV / VRAM / temp peaks	Evaluable	`max(per-window peak, last evaluable landing value)` so aggregate peak ≥ displayed current.
State gauges (VRAM used, temp, sm_clock)	Evaluable (last)	Last evaluable window's landing value.
Cumulative counters (total tokens, total reqs)	All (chronological last)	Chronologically last collected window. Idle tail included. Preserves true Prometheus server totals.
All windows non-evaluable	—	Chronologically last raw window returned in full.

Rules

Five rules. Each fires when a specific bottleneck is confirmed under load.

Rules evaluate on structurally valid windows (window_is_evaluable). Recommendations surface only when a signal is persistent across evaluable windows.

R1: Under-batching

Condition	Threshold
Scheduler occupancy	<25% of max_num_seqs
No backlog	waiting < 2
Prefill not saturated	`Δprefill_sum / window_secs < 0.40` AND `Δprefill_sum < 4.0s`

Prefill saturation uses histogram window mass (Δsum / window_secs), not the roofline ceiling. A window with more than 40% of its duration spent in prefill, or more than 4 absolute seconds, is not under-batched.

When the prefill gate suppresses R1, verbose mode (-v) shows why: Under-batching: not triggered (prefill saturated at 55%). Without -v, the rule appears as not triggered with no reason.

Fix: Raise client concurrency.

R2: KV cache pressure

R2 has two sub-variants. Both surface under the same [!] KV Cache Pressure header. The main path suppresses the backlog path when both fire.

Main path

Condition	Threshold
KV cache usage	≥88% (or preemptions active)

VRAM ≥78% is shown in output as corroborating evidence when present, but is not a required condition. R2 fires on KV usage or preemptions alone.

Fix: Reduce --max-num-seqs or --max-model-len.

Admission backlog path

Condition	Threshold
Queue ratio	waiting / (running + waiting) ≥ 30%
Free KV tokens	< demand from queued requests (waiting × prompt_tokens_mean)
Concurrency cap	running < max_num_seqs (scheduler not at cap)

Fires when the scheduler is holding requests in queue to protect KV memory, before KV usage crosses the 88% threshold. The fix is to expand the KV pool, not reduce concurrency.

Fix: Raise --gpu-memory-utilization if VRAM headroom exists; switch to fp8 KV cache; or reduce --max-model-len.

R2 evaluates against kv_cache_peak_perc, the maximum KV usage across all polls in the window, not the average. A window that averaged 70% but peaked at 92% will still fire. The spike is what causes preemptions. The average hides it.

R2 is the only rule that cross-references a live GPU reading (VRAM from NVML) with a live vLLM reading (KV cache usage). If gpu_observed_at and vllm_observed_at are more than 1 second apart, R2 is skipped. The other rules draw from a single source and don't have this constraint.

R3: Low prefix reuse

Condition	Threshold
Prefix cache hit rate	<35%
Mean prompt tokens	≥20
Request rate	≥5 req/s
Running requests	>0.75

Fix: If prefix caching is disabled: enable --enable-prefix-caching, then restructure prompts. If already enabled: move shared instructions to the start, standardize templates, avoid unique tokens at the beginning.

R4: OOM risk

Fires regardless of traffic. Configuration fact, not runtime observation.

Formula	What it computes
`usable_vram = (vram_total_gb × gpu_mem_util) − 3.0 GB`	VRAM available after activation and KV buffer reservation (3 GB constant)
`min_tp = ceil(weight_gb / usable_vram)`	Minimum tensor parallel degree to fit the model

The 3 GB buffer (ACTIVATION_KV_BUFFER_GB) reserves headroom for activations and initial KV blocks. R4 fires when kv_headroom_gb < 0 (weight_gb / tp > per-GPU VRAM). min_tp uses gpu_memory_utilization (default 0.90) for the fix recommendation.

Fix: Profile computes required TP degree: "Set --tensor-parallel-size N".

R5: Concurrency saturation

Condition	Threshold
running ≈ max_num_seqs	Within 0.5
Queue ratio	waiting / (running + waiting) ≥ 30%
Sustained	≥3 windows, ≥25% of evaluable
KV cache headroom	KV peak <80% (falls back to avg when peak absent)

Fix: Raise --max-num-seqs if KV peak <80%. If KV is at or above 80%, raising the cap will cause thrashing; add a replica or lower --max-model-len instead.

vLLM ≤0.18.0 doesn't expose vllm_max_num_seqs. Pass -m <value> or R5 won't fire.

Confidence

Per-rule, not a unified formula. Each rule sets confidence based on what signals it has.

Rule	Confidence	Condition
R1	0.80	Fixed
R2	0.95	Preemptions active
R2	0.85	KV cache ≥ 95%
R2	0.70	KV cache ≥ 88% (fire threshold), no preemptions
R2	0.85	KV admission backlog variant
R3	0.90	Fixed
R4	0.95	dtype from `DTYPE` / `VLLM_DTYPE` env var
R4	0.90	dtype from `kv_cache_dtype` or catalog
R4	0.60	dtype from bf16 fallback
R5	0.90	TTFT (mean or p99) and KV cache both present
R5	0.60	One or both absent

Rule significance gate

rule_is_significant = fired ≥ 3  AND  fired / total_evaluable_windows ≥ 25%

A rule that fired once is noise. Profile only surfaces a recommendation when the signal is persistent: at least 3 windows and at least 25% of all evaluable windows in the run.

R2 main path exception. Critical KV conditions bypass the standard gate early:

Condition	Windows required
Any window with active preemptions	1 (significant immediately)
KV ≥ 95% in ≥ 2 windows	2
Standard gate (KV 88–95%, no preemptions)	≥ 3 AND ≥ 25%

The admission backlog path and all other rules use only the standard gate.

Ranking

score = impact × confidence  // impact: 1–5, confidence: 0.0–1.0

Rules sorted by score, highest first. Grouped by root cause into IssueGroup. One primary recommendation surfaced per group: the rest are supporting evidence, not separate alerts.

Data

What, when, and how we collect.

Cadence

GPU and vLLM scraped in parallel, 250ms apart. One window is ~2 seconds, ~9 polls. Each collector timestamps on completion. If the two timestamps diverge by more than 1 second, rules that cross-reference both sources are skipped.

Within each window, different metric types are treated differently:

Type	Treatment	Edge cases
Gauges	Last scrape in the window	`kv_cache_peak_perc` is max across all scrapes in the window, not last. KV cache can spike and recover within a window; last scrape alone misses it. Shown in output when peak > avg + 10pp or peak ≥ 95%. Same pattern for VRAM peak (≥90% of total) and GPU temp peak (≥80°C).
Histograms	Δsum/Δcount, first to last scrape	Falls back to last-scrape cumulative mean when Δcount ≤ 0 (no new completions in window). p99 has no fallback; stale p99 is worse than none.
Counters	Rates = Δ / window duration. Raw totals from last scrape.	`None` on counter reset (negative delta) or zero-duration window. Zero delta is valid: no activity, not missing data.

Evaluable window gate

Two-tier gate. Structural validity and active traffic are separate checks.

Tier 1: structural. Did the endpoint respond and does the window have a valid duration? All rules use this check.

window_duration_secs  is present and positive
num_requests_running  is Some (even if zero)

running = 0 is valid data: the server responded and reported no active requests. None means the endpoint did not respond.

Tier 2: active. Was the server doing real work? Aggregated means use this check. Throughput, efficiency %, GPU utilization, and latency averages are computed over active windows only.

running_reqs > 0
AND (kv_cache_pct > 30%  OR  gpu_util_pct > 20%)

// both absent: running_reqs > 0 alone
// one absent:  evaluate the other arm only

kv_cache_pct catches decode and sustained load. gpu_util_pct catches prefill bursts where the KV cache is still near-empty. Idle windows pass tier 1 but not tier 2 and are excluded from the averages.

Multi-window aggregation, which window set applies to which field, is in Math.

Source 1: NVML

NVIDIA Management Library. Direct access to GPU hardware state, polled at 250ms intervals.

Field	Type	What it is	How collected
`gpu_util_pct`	Gauge	Fraction of time any kernel was executing. Not SM occupancy.	Mean across window polls
`mem_util_pct`	Gauge	Fraction of time the memory controller was busy	Mean across window polls
`power_watts`	Gauge	Power draw (W)	Mean across window polls
`power_limit_watts`	Gauge	Driver-set TDP limit (W)	Once at startup
`vram_used_mb`	Gauge	VRAM in use (MiB)	Last poll
`vram_total_mb`	Gauge	Total device VRAM (MiB)	Last poll
`temperature_c`	Gauge	GPU core temp (°C)	Last poll
`sm_clock_mhz`	Gauge	SM clock frequency (MHz)	Last poll

Two values are computed from the polls, not read directly from NVML: vram_peak_mb (max of vram_used_mb across all polls) and temperature_peak_c (max of temperature_c across all polls). Derived within the window, not collected.

NVML does not expose theoretical peak FLOPs or memory bandwidth. Those come from the GPU catalog, looked up once at startup from gpu_name, labeled (est) in output.

Source 2: vLLM Prometheus

Scraped from /metrics at ~250ms intervals, ~9 times per window. See vLLM docs for the full metric reference.

Field	Type	What it is
`vllm:num_requests_running`	Gauge	Active requests in flight
`vllm:num_requests_waiting`	Gauge	Requests queued, not yet scheduled
`vllm:num_requests_swapped`	Gauge	Requests with KV blocks evicted to CPU
`vllm:kv_cache_usage_perc` / `vllm:gpu_cache_usage_perc`	Gauge	KV cache fill ratio, 0–1. Primary name is `kv_cache_usage_perc`; falls back to `gpu_cache_usage_perc` on older vLLM.
`vllm:cpu_cache_usage_perc`	Gauge	CPU KV cache fill ratio
`vllm:time_to_first_token_seconds`	Histogram	Time from request arrival to first output token. TTFT mean + p99.
`vllm:request_time_per_output_token_seconds`	Histogram	Per-token decode latency. TPOT mean + p99.
`vllm:time_per_output_token_seconds`	Histogram	TPOT, older metric name. Fallback when primary is absent.
`vllm:request_prefill_time_seconds`	Histogram	Prefill latency per request
`vllm:request_queue_time_seconds`	Histogram	Time spent waiting before scheduling
`vllm:request_prompt_tokens`	Histogram	Prompt token count per request
`vllm:generation_tokens_total` / `vllm:iteration_tokens_total_sum`	Counter	Cumulative tokens generated. Primary is `generation_tokens_total`; falls back to `iteration_tokens_total_sum` on older vLLM.
`vllm:request_success_total`	Counter	Cumulative completed requests. Legacy name: `vllm:request_success`.
`vllm:num_preemptions_total`	Counter	Cumulative KV cache preemptions. Legacy name: `vllm:num_preemptions`.
`vllm:cache_config_info`	Gauge (labels)	Static cache config: block_size, cache_dtype, prefix_caching, chunked_prefill
`vllm:max_num_seqs`	Gauge	Concurrency cap. Absent in vLLM ≤0.18.0; pass `-m`.
`vllm:prefix_cache_hits_total`, `vllm:prefix_cache_queries_total`, `vllm:external_prefix_cache_hits_total`, `vllm:external_prefix_cache_queries_total`	Counter	Internal and external prefix cache hits and queries. Legacy names without `_total` also supported.

Catalog-derived fields

Looked up once at startup. See Catalog for supported GPUs and models.

Field	Source	If absent
`peak_flops_f32_tflops`, `peak_bw_gbps`	GPU catalog. Non-tensor-core FP32, conservative roofline input.	Ceilings not computed
`param_count`, `active_param_count`	Model catalog. MoE: total for weight/OOM; active for roofline.	Roofline skipped
`bytes_per_param`	`DTYPE` / `VLLM_DTYPE` env → `kv_cache_dtype` → catalog default → bf16 (2 bytes)	bf16 fallback, labeled in output

All catalog-derived numbers are labeled (est) in output. Upper-bound approximations, not measured values.

Fallbacks

Field	Primary	Fallback	If absent
`max_num_seqs`	`vllm:max_num_seqs` gauge	`-m` CLI flag	R1 and R5 silent
`tensor_parallel_size`	`--tensor-parallel-size` CLI flag	`TENSOR_PARALLEL_SIZE` / `VLLM_TENSOR_PARALLEL_SIZE` env var	Treated as 1; ceilings and kv_headroom unscaled
TPOT histogram	`request_time_per_output_token`	`time_per_output_token`	`None`
`bytes_per_param`	`DTYPE` / `VLLM_DTYPE` env var	`kv_cache_dtype` → catalog → bf16	bf16, labeled in output
All windows non-evaluable	—	Chronologically last raw window	Labeled as non-evaluable

Limitations (wip**)

Where the math is approximate and why.

Gap	Affects	In practice
MoE weight uses catalog total `param_count`	R4, baseline	Runtime VRAM may be lower if not all experts are loaded
TP degree not reported by vLLM metrics; must be set via flag or env var	Baseline, R4	If `--tensor-parallel-size` / `TENSOR_PARALLEL_SIZE` absent, Profile assumes TP=1 and underestimates ceilings on multi-GPU deployments
Efficiency % uses theoretical BW ceiling, not measured HBM	All rules	Can't distinguish BW bottleneck from scheduler under-feeding
FLOPs/s not measurable via NVML	R1	Prefill-heavy workloads show artificially low efficiency %
`vllm_max_num_seqs` absent in vLLM ≤0.18.0	R5	Requires `-m` flag
Chunked prefill: run can exceed max_num_seqs	R5	R5 skips. Cap is not the constraint
GPU and model specs from catalog, not measured	Baseline	Output labels ceilings `(est)`
p99 absent on counter reset or zero-traffic window	Display	By design. Stale p99 is worse than none.

Design (wip**)

The non-obvious choices and the reasoning behind them.

Option<T> over sentinels

Some(0.0) ≠ None. Every metric is optional. Missing data and zero are different. Treating them the same produces confident wrong answers.

Physics ceiling as a range

CeilingEstimate { lower, expected, upper }. Catalog-derived specs carry uncertainty. Output labels ceilings (est).

One recommendation per iteration

Rules grouped by root cause. One primary signal per group. Five simultaneous recommendations produce paralysis, not action.

Delta, not snapshot

Each iteration reports what changed since the last: direction, magnitude, likely cause. A snapshot without temporal comparison is a status report, not a diagnosis.

Shared traffic gate

window_is_evaluable is one function, called once, used by all rules. Idle state has no waste to measure.

Engine never imports CLI

The reasoning layer is deterministic and testable in isolation. Adding a new interface (SGLang, API) doesn't touch the engine.

250ms collection interval

Fast enough to catch GPU utilization transients that a 1s or 5s poll would average away. Slow enough that scrapes per window add no meaningful load to the vLLM endpoint or NVML.

Peak over average for KV cache

KV cache usage is tracked as the maximum across all scrapes in a window, not the last value. A spike that causes preemptions and latency degradation can recover before the window ends. The last scrape reports a safe number while the damage already happened. The peak is the signal. The average buries it.

Zero allocation in the hot loop

AnalysisInput<'a> borrows both contexts. No clones per 250ms window. A profiler that eats its target's resources is worse than no profiler.

Security

rustls only, no OpenSSL. Read-only NVML. No external API calls. No telemetry. cargo audit clean before every release tag.

The Inference Problem

Apr 19, 2026

TL;DR

Inference is 80–90% of your AI system's lifetime cost, not training.
Production servers bleed compute across batching gaps, KV cache pressure, prefill overhead, low throughput, and high latency.
The problems are fixable. Most teams can't see them.

You send a prompt. A GPU somewhere processes it and sends back a response, one token at a time. That loop is inference. It is also where most of the money goes.

The AI inference cycle. Full post.

80–90% of a production AI system's lifetime cost is inference, not training [1]

15× GPT-4 inference cost vs training cost. $2.3B in inference by end of 2024 vs ~$150M to train

Training gets the headlines. Inference pays the bills.

An H100 rents for roughly $3/hr. When your GPU is underutilized, you get fewer tokens than the hardware can deliver. Same bill, lower output. That is over $1,000/month in cloud compute per GPU, wasted.

Most teams are bad at it. The waste is in:

Throughput. A misconfigured vLLM server runs well below its hardware ceiling. When batching, memory, and scheduling are off, GPU utilization collapses. The hardware is not the problem. The configuration is.

Latency. Real-time applications need time-to-first-token under 100ms. Most production setups don't get there without deliberate tuning.

KV cache pressure. Every token generated needs to remember everything before it. When that memory fills up, the system evicts. Throughput drops. Latency spikes. At 90% usage, you're already in trouble.

Batching. When batch size is 2 out of a possible 16, you're using 12% of the machine. The rest idles. You pay full price.

Prefill overhead. Before the first output token, the model processes your entire prompt. In workloads with 500+ token prompts, prefill alone dominates response time.

Observability. Most teams look at raw Prometheus metrics and guess. There is no standard path from "throughput is low" to "here is exactly why, here is the fix."

These are not edge cases. This is the normal state of a production inference server that has not been tuned.

Why This Matters Now

The standard story about early tech: Amazon lost money for years. Uber still does. Growth first, economics later. AI inference does not fit that story.

Amazon chose to lose money, a deliberate bet on demand. Companies running AI inference are burning money on hardware they already paid for, because it is sitting misconfigured. That is not a growth strategy. It is waste that shows up on your cloud compute bill this month.

Enterprises running production AI spend $50,000 to $500,000 per month on inference infrastructure. [2] Inference costs have dropped 280× since 2022, from $20 to $0.07 per million tokens. That gap between what hardware can do and what most teams extract from it is your cost problem today.

"The AI inferencing market will be much, much larger than the AI training market. People are running out of usable inference computing capacity." — Larry Ellison, Oracle earnings call [3]

The shortage is not GPUs. It is efficient use of the GPUs already running in production, including yours.

Sources

1. Beyond Benchmarks: The Economics of AI Inference, arXiv

2. AI Inference Costs in 2025, TensorMesh

3. Larry Ellison Targets Multi-Trillion-Dollar AI Inference Market, Benzinga