Profile: Stop Guessing vLLM Configs, get 15x tok/s + 93% less cost on Qwen3.6-27B w/ 8192 ctx size

In this demo, we run Profile, a physics-grounded + cost aware optimization loop, on a struggling vLLM inference server to get best of our setup. Setup: Qwen3.6-27B on a single NVIDIA A100-80GB (RAG workload) with 8912 context size. Problem: Our vLLM server was struggling. We hit poor 31 tokens/sec. We were guessing max-num-seqs limits, and other configs to fix it. The Fix & Results: Over few iterations, it measured live traffic, calculated physics ceiling of A100, and told us what was bottlenecking our compute. We applied its config fixes, watched the improvement, and repeated the process until we hit the hardware limit. ❌ Untuned: 31 tok/s ($13.26 / 1M tokens) ✅ Profile Tuned: 470 tok/s ($0.89 / 1M tokens) ✅ 15x throughput. 93% cost reduction. Final numbers: GPU. EFFICIENCY 8.1% | POWER 390W | 0.83 J/tok | $0.89/1M tok (est) | vRAM 77/80GB (peak 79GB) vLLM: REQUESTS run 100 (95.6%) | wait 149 | max 105 LATENCY ttft 52.9s (p95 129.2s) | tpot 199ms (p95 295ms) CACHE kv_cache 81.5% avg | pfix_cache 61.6% THROUGHPUT 470 tok/s TRAFFIC qps 1.2 | req_total 74 | gen_total 31308 | preempt/s 0.00 | preempt_total 0 Once at limit, profile detected saturation. Instead of wasting days on tweaking configs, it proved GPU was maxed out & advised a scale-out to preserve latency. No AI guesswork. Just physics ceilings and math. Links: Profile: https://jungledesh.github.io/profile/ Github: https://github.com/jungledesh/profile