Mixture of Experts: The AI Trick Eating the World's Memory

Mixture of Experts is the design behind most of today's biggest models. It lets a model hold a trillion parameters but only run a few billion per word, so the compute bill stays small. The cost lands somewhere else: every expert it isn't using still has to sit in memory, waiting to be picked. Across DeepSeek, Kimi, Qwen and Llama 4 that adds up to a structural appetite for RAM, and it's now spilling into the real world — Micron is sold out through 2026, prices have roughly doubled, and data centers are taking about 70% of the memory being made. This is the story of how a trick meant to save compute turned memory into the scarce resource. Chapters 00:00 Why memory suddenly got expensive 00:29 Total vs. active parameters 01:10 How routing works 01:58 What an "expert" really is 02:44 Where MoE came from (1991 - 2017) 03:53 Mixtral makes it concrete 04:33 Why the simple router won 05:19 Why a Mac beats a 3090 for MoE 06:08 The model landscape, as a memory story 06:49 How DeepSeek stretched memory further 07:28 The memory wall 08:08 Speeding up the plumbing 08:50 Running experts from an SSD 09:46 The energy catch 10:26 Newer ways to page experts 11:00 The bet underneath it Sources & further reading The papers Adaptive Mixtures of Local Experts (Jacobs, Jordan, Nowlan, Hinton, 1991): https://www.cs.toronto.edu/~hinton/ab... Outrageously Large Neural Networks: the sparsely-gated MoE layer (Shazeer et al., 2017): https://arxiv.org/abs/1701.06538 Switch Transformer (Fedus, Zoph et al., 2021): https://arxiv.org/abs/2101.03961 Mixtral of Experts (Mistral, 2024): https://arxiv.org/abs/2401.04088 DeepSeekMoE: fine-grained and shared experts (2024): https://arxiv.org/abs/2401.06066 DeepSeek-V3 technical report (2024): https://arxiv.org/abs/2412.19437 LLM in a Flash: streaming weights from SSD (Apple, 2023): https://arxiv.org/abs/2312.11514 SSD Offloading for MoE Weights Considered Harmful in Energy Efficiency (2025): https://arxiv.org/abs/2508.06978 FlashMoE: ML-based expert caching (2026): https://arxiv.org/abs/2601.17063 Primers Hugging Face — Mixture of Experts, explained — https://huggingface.co/blog/moe Mistral — Mixtral announcement — https://mistral.ai/news/mixtral-of-ex... The memory crunch CNBC — AI memory is sold out; prices surging: https://www.cnbc.com/2026/01/10/micro... IEEE Spectrum — how and when the DRAM shortage ends: https://spectrum.ieee.org/dram-shortage SemiAnalysis — the memory wall and the HBM roadmap: https://newsletter.semianalysis.com/p... Practitioners & systems Jeremy Howard on MoE, 3090s and Macs: https://x.com/jeremyphoward/status/19... Tri Dao on MoE kernels: https://x.com/tri_dao/status/20017852... vLLM elastic expert parallelism: https://x.com/vllm_project/status/205... Perplexity — trillion-parameter MoE across cloud nodes: https://research.perplexity.ai/articl... Simon Willison — running Qwen 397B on a Mac with LLM-in-a-flash: https://simonwillison.net/2026/Mar/18... llama.cpp — paging MoE experts from disk: https://github.com/ggml-org/llama.cpp...