INFERENCE DIAGNOSTICS FOR vLLM

Are you getting what your hardware is capable of?

Profile computes the physics ceiling for your GPU and model, measures the live server against it, names the one cause holding it back, and gives you the flag to change. Then it measures whether the fix worked.

Less words. Less noise. More signal. More value.

Install Read the docs

15xthroughput, A100 run

3.9xthroughput, H100 run

93%lower $/1M tok, A100 run

1binary, nothing leaves the machine

WHAT YOU GET

The state of the server, the one thing wrong with it, and the fix.

Every value is measured or marked. A dash means Profile could not read it. An (est) means the number came from the physics model, not the server.

profile diagnose --url http://localhost:8000/metrics --duration 2m

PROFILE v2.1.4 [Qwen3.6-27B] [NVIDIA H100 80GB HBM3]

GPU        decode_eff ~0.9% | power 653W | $5.09/1M output tok (est) | vRAM 74/80GB
REQUESTS   run 14 (4.0%) | wait 4 | max 345
LATENCY    ttft 8.7s (p95 19.2s) | tpot 97ms (p95 159ms)
CACHE      kv_cache 93.1% avg (100.0% peak) | pfix_cache -
THROUGHPUT 163 tok/s

ISSUES:

[!] KV Cache Pressure   seen in 92% of windows
    Cause:  KV cache 94% avg in fired windows, 100% peak (threshold: 88%)
            4 requests queued on KV admission
    Fix:    Lower --max-model-len (current: 262144). Observed avg 13.9k tok/request.
            Enable --enable-prefix-caching to share KV blocks across prefixes.
    Confidence: High

THE ENGINE

One cause at a time.

Eight rules watch eight failure modes. A mutual exclusivity table removes symptoms another cause already explains, and a priority DAG keeps a tuning suggestion from ever outranking an active bottleneck. One primary is shown; the rest are held.

THE LOOP

Apply. Measure. Repeat.

Diagnose

Profile reads the live server under its own traffic. No restarts, no synthetic load, no agent.

Apply the fix

Everything in the block, one restart. Profile reconnects when vLLM returns.

New --max-num-seqs [current: 345]: 170

Measure the delta

Regressions are labelled, not buried. A tool that only reports improvements cannot be trusted when it reports one.

Throughput   610 → 545 tok/s  worse
TTFT         1024 → 6067ms  worse
TPOT         118.4 → 147.0ms  worse

ECONOMICS:
Cost/1M      $1.36 → $1.52 (est)  worse

PROOF

Two runs. Same model, different hardware.

A100-SXM4-80GB · QWEN3.6-27B

Throughput	31 → 470 tok/s
Cost	$13.26 → $0.89 / 1M tok

15x throughput · 93% lower cost

H100 80GB HBM3 · QWEN3.6-27B

Throughput	163 → 631 tok/s
Cost	$5.09 → $1.32 / 1M output tok

3.9x throughput · 74% lower cost · 7 iterations, 2 labelled regressions

The H100 sequence was not monotonic: 163, 328, 545, 543, 482, 610, 545, 631. Profile labelled both regressions. Your starting point sets your gain; a server already near its ceiling has nothing to recover, and Profile says so. Watch the A100 run →

WHERE THIS GOES

Diagnose one node. Then run the fleet.

Today, Profile diagnoses a single node, and every number it prints is measured or marked as an estimate. That constraint is the product: a diagnostic tool has nothing but its credibility.

The end state is a control plane. Profile running as a daemon across a fleet: re-sharding a node paying the PCIe tensor-parallel tax, moving traffic off a node under KV pressure before latency spikes, holding every node at the SLA-throughput knee as traffic shifts through the day.

None of that is safe to build until the physics and the engine are right on one node. That is the work happening now, in the open.

INSTALL

One binary. No agent, no config, no calibration.

curl --proto '=https' --tlsv1.2 -LsSf \
  https://github.com/jungledesh/profile/releases/latest/download/profile-installer.sh | sh

profile diagnose --url http://localhost:8000/metrics --duration 2m

NVIDIA or AMD, one GPU. Requires vLLM with /metrics reachable and live traffic. Nothing leaves the machine. Binaries on the releases page.