SELF-DIRECTED P̶h̶D̶ EXD in AI Ep. 4: Inference Benchmarking continued

Welcome back to the EXD. Last week we took our first look at LLM inference using vLLM. More specifically, we learned about the prefill and decode phases of an inference pass and how their performance characteristics differ. This week we’ll dig a little deeper and make our first attempt to tune vLLM for better performance. My name is Ram, and I work at the Ethereum Foundation on internal AI ops, and this is an open learning log for what I call the EXD. Episode 01:    • SELF-DIRECTED P̶h̶D̶ EXD in AI Ep. 1: What...   EXD: github.com/Ramshreyas/EXD Llama-benchy: https://github.com/eugr/llama-benchy