Llama.cpp Just Merged MTP And You Should Be Using It.

MTP (Multi-Token prediction) is not a new idea, but it is finally supported in the beloved llama.cpp engine! MTP is basically SSD (Speculative Decoding) but all packaged into a single model! Depending on model/hardware you can get up to 2x faster TPS with no downside! Not *every* model supports MTP, and if you are using something like Qwen3.5 or Qwen3.6, youll need to redownload your GGUF file with MTP support since this was merged so recently. That being said, you can I was getting 25% faster TPS on my M4 Pro but depending on hardware you can get a lot more. All of this comes without any accuracy tradeoffs, you just get more TPS on the exact same hardware with a simple llama.cpp config option! Pretty cool and I am happy this got merged finally since its likely we see a lot more MTP models in the future. Links : LLamacpp PR: https://github.com/ggml-org/llama.cpp... Download Llamacpp: https://github.com/ggml-org/llama.cpp... AnythingLLM: https://github.com/Mintplex-Labs/anyt... Qwen 3.5 9B MTP GGUF example: https://huggingface.co/unsloth/Qwen3.... Chapters : 0:00 Local AI is improving fast 1:35 Intro to AnythingLLM 2:35 MTP (Multi Token Prediction) is merged! 3:18 What is MTP? 5:37 What models support MTP? 7:20 MTP support is still in progress! 7:53 Here is the annoying part... 9:53 How to run llama.cpp with MTP support locally! 11:28 Benchmarking, running and tuning MTP for local AI 15:25 MTP is a welcome addition to local AI for llama.cpp!