Profile: Stop Guessing vLLM Configs, get 15x tok/s + 93% less cost on Qwen3.6-27B w/ 8192 ctx size

In this demo, we run Profile, a physics-grounded + cost aware optimization loop, on a struggling vLLM inference server to get best of our setup. Setup: Qwen3.6-27B on a single NVIDIA A100-80GB (RAG workload) with 8912 context size. Problem: Our vLLM server was struggling. We hit poor 31 tokens/sec. We were guessing max-num-seqs limits, and other configs to fix it. The Fix & Results: Over few iterations, it measured live traffic, calculated physics ceiling of A100, and told us what was bottlenecking our compute. We applied its config fixes, watched the improvement, and repeated the process until we hit the hardware limit. ❌ Untuned: 31 tok/s ($13.26 / 1M tokens) ✅ Profile Tuned: 470 tok/s ($0.89 / 1M tokens) ✅ 15x throughput. 93% cost reduction. Final numbers: GPU. EFFICIENCY 8.1% | POWER 390W | 0.83 J/tok | $0.89/1M tok (est) | vRAM 77/80GB (peak 79GB) vLLM: REQUESTS run 100 (95.6%) | wait 149 | max 105 LATENCY ttft 52.9s (p95 129.2s) | tpot 199ms (p95 295ms) CACHE kv_cache 81.5% avg | pfix_cache 61.6% THROUGHPUT 470 tok/s TRAFFIC qps 1.2 | req_total 74 | gen_total 31308 | preempt/s 0.00 | preempt_total 0 Once at limit, profile detected saturation. Instead of wasting days on tweaking configs, it proved GPU was maxed out & advised a scale-out to preserve latency. No AI guesswork. Just physics ceilings and math. Links: Profile: https://jungledesh.github.io/profile/ Github: https://github.com/jungledesh/profile

Android 17 sucks. So I put Linux on a phone.

Android 17 sucks. So I put Linux on a phone.

I Hacked This Temu Router. What I Found Should Be Illegal.

I Hacked This Temu Router. What I Found Should Be Illegal.

MIT Just Revealed the AI Bubble's Fatal Flaw

MIT Just Revealed the AI Bubble's Fatal Flaw

How Rockstar fit an entire city into PlayStation 2 memory

How Rockstar fit an entire city into PlayStation 2 memory

Why Am I Obsessed With These German Metalheads?

Why Am I Obsessed With These German Metalheads?

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

I Tested the Cheapest Path to 96GB of VRAM

I Tested the Cheapest Path to 96GB of VRAM

Complete GitHub Actions Course - From BEGINNER to PRO

Complete GitHub Actions Course - From BEGINNER to PRO

RollerCoaster Tycoon Optimizations are Insane

RollerCoaster Tycoon Optimizations are Insane

I Built a Virus for this Cocky Scammer

I Built a Virus for this Cocky Scammer

Backend web development - a complete overview

Backend web development - a complete overview

The Best Local Agentic Coding Workflow (Complete Guide)

The Best Local Agentic Coding Workflow (Complete Guide)

I Made Opus 4.8 and Fable 5 Build the Same App (RAW RESULTS)

I Made Opus 4.8 and Fable 5 Build the Same App (RAW RESULTS)

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

How I animate 3Blue1Brown | A Manim demo with Ben Sparks

How I animate 3Blue1Brown | A Manim demo with Ben Sparks

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Trump Sends Vance to Concede to Iran & Reflecting Pool Is Filled with Corruption | The Daily Show

Trump Sends Vance to Concede to Iran & Reflecting Pool Is Filled with Corruption | The Daily Show

System Design Concepts Course and Interview Prep

System Design Concepts Course and Interview Prep

Deep Dive into LLMs like ChatGPT

Deep Dive into LLMs like ChatGPT

How to make 3D Games in Godot

How to make 3D Games in Godot