SELF-DIRECTED P̶h̶D̶ EXD in AI Ep. 5: Speculative Decoding
Welcome back to the EXD! Last week we took a deeper look at inference benchmarking with Llama-benchy. For example, we learned about how overall token generation can speed up under concurrent loads. This week we look at speculative decoding aka Multi-Token Prediction or MTP. Speculative decoding is a rather clever way of better utilizing your compute resources in the decode pass. Today we will just show that it actually does work, and in a future episode when we introduce LLM architectures we can understand why it works. My name is Ram, I work at the Ethereum Foundation on AI ops, and this is an open learning log that I call the EXD. Episode 1: • SELF-DIRECTED P̶h̶D̶ EXD in AI Ep. 1: What... Github: https://github.com/Ramshreyas/EXD

SELF-DIRECTED P̶h̶D̶ EXD in AI Ep. 6: Taking Stock

Recursive Self-Improvement

Are we really doing this again

SELF-DIRECTED P̶h̶D̶ EXD in AI Ep. 4: Inference Benchmarking continued

Why Aliens Would NEVER Invade Africa

She Asks if I Know Coldplay and This Singer Shocks The Street

What Nobody Tells You About Being a Quant

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

You Know This Song (but the Orchestra Doesn’t) | Jacob Collier & VSO School of Music Orchestra | TED

Gemma4 12B Coder - Composer 2.5 × Fable 5 v2 vs base - 16GB Local LLM setup

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

The problem with using AI to help you learn (programming, and in general)

Ex-Google Insider: You're Not Ready For The Next Phase of AI

Anthropic is Completely F*cked.

Why AI Agents are either the best or worst thing we’ve ever built

LLMs Don't Need More Parameters. They Need Loops.

The Moment That Changed Software Development!

bounce + bounce = no bounce

