Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

From building NVIDIA’s Cosmos world model to joining xAI as Grok Imagine was being built from zero to one, Ethan He has been at the center of some of the most important work in video generation, multimodal models, and real-time world models. In this episode, Ethan joins swyx and Vibhu to unpack what it actually takes to build frontier image and video systems: data, VAEs, diffusion transformers, audio-video alignment, inference speedups, and the hidden cost of storing and moving massive video datasets. We go deep on Grok Imagine, how a small xAI team shipped its first multimodal video model in three months, why iteration speed matters more than almost anything in model development, and why many of the biggest gains come from fixing tiny bugs in data and training pipelines. Ethan also explains why video models may become the front end of AI, how generative UI could replace traditional interfaces, why world models need to be real-time, interactive, and long-horizon, and why the future of video generation may depend more on language models and agents than on diffusion alone. We discuss: • Ethan’s path from NVIDIA Cosmos to xAI and Grok Imagine • How xAI built its first image and video models from zero to one • Why fast iteration, infra, and talent mattered more than meetings • Why small data and training bugs can drive huge model quality gains • Why coding models may make compute the bottleneck again • How image and video models are trained with synthetic captions • VAEs, tokenizers, latent space, and diffusion transformers • Why image models are the foundation for video models • Temporal compression, real-time video, and interactivity tradeoffs • Flipbook, Neural OS, and the future of generative UI • Why future interfaces may go directly from user intent to pixels • The cost of training video models: storage, egress, and GPU hours • Step distillation, consistency models, GANs, and fast inference • Grok Imagine 0.9 and large-scale audio-video generation • Why audio-video alignment is harder than text-video alignment • Ethan’s definition of world models: real-time, interactive, long-horizon video • Reference-to-video, video extension, and long-context video generation • Why xAI’s research communication undersells the work behind Grok Imagine • xAI culture, first-principles thinking, and working with Elon • AI watermarking, SynthID, safety, and detecting generated media • Prompt rewriting and why video models take instructions literally • Grok Imagine Agent, video editing, and the rise of video agents • Why language models may unlock the next wave of video generation • Robotics, physical AI, and why embodiment may emerge from video-world models • Why Ethan left xAI and why he is now focusing more on LLMs • Self-managed context, memory, and the next frontier for language models — Ethan He • LinkedIn:   / ethanhe42   • X: https://x.com/EthanHe_42 Timestamps 00:00:00 Hook 00:01:16 Introduction 00:02:41 From NVIDIA Cosmos to xAI 00:04:40 Building Grok Imagine from Zero to One 00:11:23 How Image and Video Models Are Trained 00:20:09 Video Compression, VAEs, and Real-Time Tradeoffs 00:23:26 Generative UI, Flipbook, and Neural OS 00:33:26 The Cost of Training Large Video Models 00:38:20 Distillation, GANs, and Fast Video Inference 00:42:37 Audio-Video Generation and Grok Imagine 0.9 00:49:50 What Makes a World Model? 00:57:07 Reference Videos, Long Context, and Video Memory 01:01:27 xAI Culture, Research, and First-Principles Building 01:11:01 AI Safety, Watermarking, and Prompt Rewriting 01:14:26 Video Agents and AI-Assisted Creation 01:28:48 Why Language Models Unlock Better Video 01:32:31 Robotics, Physical AI, and Embodied World Models 01:33:54 Why Ethan Left xAI 01:35:32 Self-Managed Context and the Future of LLMs 01:39:59 Ethan’s Career Path and Closing Thoughts