Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

From building NVIDIA’s Cosmos world model to joining xAI as Grok Imagine was being built from zero to one, Ethan He has been at the center of some of the most important work in video generation, multimodal models, and real-time world models. In this episode, Ethan joins swyx and Vibhu to unpack what it actually takes to build frontier image and video systems: data, VAEs, diffusion transformers, audio-video alignment, inference speedups, and the hidden cost of storing and moving massive video datasets. We go deep on Grok Imagine, how a small xAI team shipped its first multimodal video model in three months, why iteration speed matters more than almost anything in model development, and why many of the biggest gains come from fixing tiny bugs in data and training pipelines. Ethan also explains why video models may become the front end of AI, how generative UI could replace traditional interfaces, why world models need to be real-time, interactive, and long-horizon, and why the future of video generation may depend more on language models and agents than on diffusion alone. We discuss: • Ethan’s path from NVIDIA Cosmos to xAI and Grok Imagine • How xAI built its first image and video models from zero to one • Why fast iteration, infra, and talent mattered more than meetings • Why small data and training bugs can drive huge model quality gains • Why coding models may make compute the bottleneck again • How image and video models are trained with synthetic captions • VAEs, tokenizers, latent space, and diffusion transformers • Why image models are the foundation for video models • Temporal compression, real-time video, and interactivity tradeoffs • Flipbook, Neural OS, and the future of generative UI • Why future interfaces may go directly from user intent to pixels • The cost of training video models: storage, egress, and GPU hours • Step distillation, consistency models, GANs, and fast inference • Grok Imagine 0.9 and large-scale audio-video generation • Why audio-video alignment is harder than text-video alignment • Ethan’s definition of world models: real-time, interactive, long-horizon video • Reference-to-video, video extension, and long-context video generation • Why xAI’s research communication undersells the work behind Grok Imagine • xAI culture, first-principles thinking, and working with Elon • AI watermarking, SynthID, safety, and detecting generated media • Prompt rewriting and why video models take instructions literally • Grok Imagine Agent, video editing, and the rise of video agents • Why language models may unlock the next wave of video generation • Robotics, physical AI, and why embodiment may emerge from video-world models • Why Ethan left xAI and why he is now focusing more on LLMs • Self-managed context, memory, and the next frontier for language models — Ethan He • LinkedIn: / ethanhe42 • X: https://x.com/EthanHe_42 Timestamps 00:00:00 Hook 00:01:16 Introduction 00:02:41 From NVIDIA Cosmos to xAI 00:04:40 Building Grok Imagine from Zero to One 00:11:23 How Image and Video Models Are Trained 00:20:09 Video Compression, VAEs, and Real-Time Tradeoffs 00:23:26 Generative UI, Flipbook, and Neural OS 00:33:26 The Cost of Training Large Video Models 00:38:20 Distillation, GANs, and Fast Video Inference 00:42:37 Audio-Video Generation and Grok Imagine 0.9 00:49:50 What Makes a World Model? 00:57:07 Reference Videos, Long Context, and Video Memory 01:01:27 xAI Culture, Research, and First-Principles Building 01:11:01 AI Safety, Watermarking, and Prompt Rewriting 01:14:26 Video Agents and AI-Assisted Creation 01:28:48 Why Language Models Unlock Better Video 01:32:31 Robotics, Physical AI, and Embodied World Models 01:33:54 Why Ethan Left xAI 01:35:32 Self-Managed Context and the Future of LLMs 01:39:59 Ethan’s Career Path and Closing Thoughts

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

India Economy OK…But Danger Ahead? 5-State Polls में NDA 2-0 Lead? • Sriram Seshadri

India Economy OK…But Danger Ahead? 5-State Polls में NDA 2-0 Lead? • Sriram Seshadri

The Last Word with Lawrence O'Donnell - May 13 | Audio Only

The Last Word with Lawrence O'Donnell - May 13 | Audio Only

Begin Proof — Noam Brown

Begin Proof — Noam Brown

What Happens When 80% of PRs Come From Agents? — Kyle Daigle, GitHub COO

What Happens When 80% of PRs Come From Agents? — Kyle Daigle, GitHub COO

AI Pioneer Geoffrey Hinton: AI Is Conscious, Superintelligence is Coming, And We Should Be Worried

AI Pioneer Geoffrey Hinton: AI Is Conscious, Superintelligence is Coming, And We Should Be Worried

Will AI Make Me Worse? [Wading Through AI - Episode 5]

Will AI Make Me Worse? [Wading Through AI - Episode 5]

What Happens After A 1,000,000x AI Compute Leap? | Jeff Dean

What Happens After A 1,000,000x AI Compute Leap? | Jeff Dean

Microsoft Build event in 25 minutes

Microsoft Build event in 25 minutes

OpenAI CFO Sarah Friar: IPO, AI Rivalries, New Device, and Spending $100B+ on Compute

OpenAI CFO Sarah Friar: IPO, AI Rivalries, New Device, and Spending $100B+ on Compute

How a reasoning model cracked an 80-year-old math problem — the OpenAI Podcast Ep. 20

How a reasoning model cracked an 80-year-old math problem — the OpenAI Podcast Ep. 20

AI Research Legend’s Honest Assessment of Where We Are

AI Research Legend’s Honest Assessment of Where We Are

Can Yann LeCun Reshape AI (again)?

Can Yann LeCun Reshape AI (again)?

The better AI gets, the smaller its share of the economy might get – Alex Imas and Phil Trammell

The better AI gets, the smaller its share of the economy might get – Alex Imas and Phil Trammell

OpenAI's Dan Roberts: Why AI Can Now Make Discoveries

OpenAI's Dan Roberts: Why AI Can Now Make Discoveries

Nvidia’s Jensen Huang on the AI revolution, job losses and what drives him | Full interview

Nvidia’s Jensen Huang on the AI revolution, job losses and what drives him | Full interview

Might SpaceX Buy Tesla at 2x The Price?

Might SpaceX Buy Tesla at 2x The Price?

Building OpenCode with Dax Raad

Building OpenCode with Dax Raad

He honestly thinks we can afford this

He honestly thinks we can afford this

Nvidia CEO Live on Bloomberg Technology (full show) #tech

Nvidia CEO Live on Bloomberg Technology (full show) #tech