SELF-DIRECTED P̶h̶D̶ EXD in AI Ep. 5: Speculative Decoding

Welcome back to the EXD! Last week we took a deeper look at inference benchmarking with Llama-benchy. For example, we learned about how overall token generation can speed up under concurrent loads. This week we look at speculative decoding aka Multi-Token Prediction or MTP. Speculative decoding is a rather clever way of better utilizing your compute resources in the decode pass. Today we will just show that it actually does work, and in a future episode when we introduce LLM architectures we can understand why it works. My name is Ram, I work at the Ethereum Foundation on AI ops, and this is an open learning log that I call the EXD. Episode 1:    • SELF-DIRECTED P̶h̶D̶ EXD in AI Ep. 1: What...   Github: https://github.com/Ramshreyas/EXD