I-JEPA: Importance of Predicting in Latent Space
I-JEPA is the first implementation of the Joint Embedding Predictive Architecture (JEPA) by Yann LeCun. I am a huge fan of LeCun, and many of my AI thoughts have been powered by his views as well. However, I am not in agreement with using Vision Transformers (ViT) as the encoder, as it loses most semantic information about the spatial component of images. Furthermore, it takes a long time to learn as it does not have the relevant inductive biases for learning images (a.k.a. translational invariance). While I-JEPA achieves quite amazing downstream task performance like on the ImageNet Top-1 prediction task, it could perhaps be better if the masked objective can be done on a CNN-like architecture instead, with self-attention layers perhaps over the post-filter outputs. We could also explore doing a Stable-Diffusion-like conditioning, whereby the predictor module is conditioned on some text input to predict the latent space. Broad-level to specific-level conditioning, and using memory of similar latent spaces, is also something that can be explored. In the end, I believe a hierarchical architecture, going from broad to specific, with each layer of abstraction conditioning on the broader layer of abstraction above, and finally attention between all the generated layers of abstraction (or latent space) to use for prediction could be a better bet. That said, I-JEPA is a promising first step, and I am excited to see what comes next. ~~~~~~~~~~~~~~~~~~ Slides: https://github.com/tanchongmin/Tensor... Reference Materials: I-JEPA: https://arxiv.org/abs/2301.08243 Vision Transformers: https://arxiv.org/abs/2010.11929 Swin Transformers (Transformers with hierarchy and shifting attention windows): https://arxiv.org/abs/2103.14030 MLP-Mixer (All MLP only image processing): https://arxiv.org/abs/2105.01601 Conv-Mixer (Patches with Conv layers): https://arxiv.org/abs/2201.09792 Stable Diffusion: https://arxiv.org/abs/2112.10752 ~~~~~~~~~~~~~~~~~~ (0:00) Introduction 5:54 Transformers: Prediction back in input space 11:12 Prediction in Latent Space 22:25 Stable Diffusion and Latent Space 29:17 Vision Transformer (ViT) 44:57 Swin Transformer 50:12 ViT’s positional encoding may not be good! 51:38 I-JEPA 1:09:26 Discussion on how to improve I-JEPA ~~~~~~~~~~~~~~~~~~~ AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator. Discord: / discord LinkedIn: / chong-min-tan-94652288 Online AI blog: https://delvingintotech.wordpress.com/ Twitter: / johntanchongmin Try out my games here: https://simmer.io/@chongmin
![Yann LeCun's $1B Bet Against LLMs [Part 1]](https://i.ytimg.com/vi/kYkIdXwW2AE/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLDbV4izF3i-wxevCVIn7FJjoy1vlA)
Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun: World Models: Enabling the next AI revolution

5) Meta Llama Cookbook

DINOv3: One backbone, multiple image/video tasks

OpenVLA: LeRobot Research Presentation #5 by Moo Jin Kim

"Online FDP-"Advanced Communication Systems: Shaping Next-Generation Networks" - Session 2

Yann LeCun | Self-Supervised Learning, JEPA, World Models, and the future of AI

Why AI systems don't learn and what to do about it

I-JEPA Explained with a Single Batch Run

Yann LeCun: Special Lecture on AI and World Models

GNN Explanations that do not Explain and Hot to Find Them

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

JEPA - A Path Towards Autonomous Machine Intelligence (Paper Explained)

Python Project | Python Projects For Beginners | Python Project Tutorial | Intellipaat

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer

Building an AI Dark Factory: A Codebase That Writes Its Own Code, Live

Complete Agentic AI Course - AI Agents, RAG, Embeddings, Architectures, Framework, VectorDB & Memory

Yann LeCun - A Path Towards Autonomous Machine Intelligence

V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video (Explained)

