Mixture of Experts: The AI Trick Eating the World's Memory

Mixture of Experts is the design behind most of today's biggest models. It lets a model hold a trillion parameters but only run a few billion per word, so the compute bill stays small. The cost lands somewhere else: every expert it isn't using still has to sit in memory, waiting to be picked. Across DeepSeek, Kimi, Qwen and Llama 4 that adds up to a structural appetite for RAM, and it's now spilling into the real world — Micron is sold out through 2026, prices have roughly doubled, and data centers are taking about 70% of the memory being made. This is the story of how a trick meant to save compute turned memory into the scarce resource. Chapters 00:00 Why memory suddenly got expensive 00:29 Total vs. active parameters 01:10 How routing works 01:58 What an "expert" really is 02:44 Where MoE came from (1991 - 2017) 03:53 Mixtral makes it concrete 04:33 Why the simple router won 05:19 Why a Mac beats a 3090 for MoE 06:08 The model landscape, as a memory story 06:49 How DeepSeek stretched memory further 07:28 The memory wall 08:08 Speeding up the plumbing 08:50 Running experts from an SSD 09:46 The energy catch 10:26 Newer ways to page experts 11:00 The bet underneath it Sources & further reading The papers Adaptive Mixtures of Local Experts (Jacobs, Jordan, Nowlan, Hinton, 1991): https://www.cs.toronto.edu/~hinton/ab... Outrageously Large Neural Networks: the sparsely-gated MoE layer (Shazeer et al., 2017): https://arxiv.org/abs/1701.06538 Switch Transformer (Fedus, Zoph et al., 2021): https://arxiv.org/abs/2101.03961 Mixtral of Experts (Mistral, 2024): https://arxiv.org/abs/2401.04088 DeepSeekMoE: fine-grained and shared experts (2024): https://arxiv.org/abs/2401.06066 DeepSeek-V3 technical report (2024): https://arxiv.org/abs/2412.19437 LLM in a Flash: streaming weights from SSD (Apple, 2023): https://arxiv.org/abs/2312.11514 SSD Offloading for MoE Weights Considered Harmful in Energy Efficiency (2025): https://arxiv.org/abs/2508.06978 FlashMoE: ML-based expert caching (2026): https://arxiv.org/abs/2601.17063 Primers Hugging Face — Mixture of Experts, explained — https://huggingface.co/blog/moe Mistral — Mixtral announcement — https://mistral.ai/news/mixtral-of-ex... The memory crunch CNBC — AI memory is sold out; prices surging: https://www.cnbc.com/2026/01/10/micro... IEEE Spectrum — how and when the DRAM shortage ends: https://spectrum.ieee.org/dram-shortage SemiAnalysis — the memory wall and the HBM roadmap: https://newsletter.semianalysis.com/p... Practitioners & systems Jeremy Howard on MoE, 3090s and Macs: https://x.com/jeremyphoward/status/19... Tri Dao on MoE kernels: https://x.com/tri_dao/status/20017852... vLLM elastic expert parallelism: https://x.com/vllm_project/status/205... Perplexity — trillion-parameter MoE across cloud nodes: https://research.perplexity.ai/articl... Simon Willison — running Qwen 397B on a Mac with LLM-in-a-flash: https://simonwillison.net/2026/Mar/18... llama.cpp — paging MoE experts from disk: https://github.com/ggml-org/llama.cpp...

Why AI Tokens are so Expensive - Computerphile

Why AI Tokens are so Expensive - Computerphile

Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun's $1B Bet Against LLMs [Part 1]

AI Bubble: The data center oversupply crisis is coming | Ed Zitron

AI Bubble: The data center oversupply crisis is coming | Ed Zitron

Understand AI in 14 minutes – with Anthropic's Chloe Lubinski [ARC 2026]

Understand AI in 14 minutes – with Anthropic's Chloe Lubinski [ARC 2026]

Using Large Language Models | Build Your Own LLM Workshop #1

Using Large Language Models | Build Your Own LLM Workshop #1

Android 17 sucks. So I put Linux on a phone.

Android 17 sucks. So I put Linux on a phone.

Training Sand to Think: Artificial General Intelligence & Future of Physics

Training Sand to Think: Artificial General Intelligence & Future of Physics

Why the Speed of Light Is NOT a Speed - Leonard Susskind

Why the Speed of Light Is NOT a Speed - Leonard Susskind

MCP vs API: Why traditional APIs are failing AI agents

MCP vs API: Why traditional APIs are failing AI agents

Why Google Just Gave Away Gemma 4 for Free

Why Google Just Gave Away Gemma 4 for Free

Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI

Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI

Stop Prompting Claude. Use Karpathy's Method Instead.

Stop Prompting Claude. Use Karpathy's Method Instead.

Loop Engineering explained in 8min..

Loop Engineering explained in 8min..

Deepseek drops another HUGE breakthrough

Deepseek drops another HUGE breakthrough

Who's winning (& losing) the AI race?

Who's winning (& losing) the AI race?

The World's Evilest Company

The World's Evilest Company

It’s Not NVIDIA You Should be Watching. It’s...

It’s Not NVIDIA You Should be Watching. It’s...

Software Architecture's Biggest Enemy (Not What You Think)

Software Architecture's Biggest Enemy (Not What You Think)

Why Did Ancient Humans Begin Cooking?

Why Did Ancient Humans Begin Cooking?

Prof. Dr. Christian Bauckhage (Fraunhofer IAIS): KI - Wir haben noch gar nichts gesehen!

Prof. Dr. Christian Bauckhage (Fraunhofer IAIS): KI - Wir haben noch gar nichts gesehen!