I-JEPA: Importance of Predicting in Latent Space

I-JEPA is the first implementation of the Joint Embedding Predictive Architecture (JEPA) by Yann LeCun. I am a huge fan of LeCun, and many of my AI thoughts have been powered by his views as well. However, I am not in agreement with using Vision Transformers (ViT) as the encoder, as it loses most semantic information about the spatial component of images. Furthermore, it takes a long time to learn as it does not have the relevant inductive biases for learning images (a.k.a. translational invariance). While I-JEPA achieves quite amazing downstream task performance like on the ImageNet Top-1 prediction task, it could perhaps be better if the masked objective can be done on a CNN-like architecture instead, with self-attention layers perhaps over the post-filter outputs. We could also explore doing a Stable-Diffusion-like conditioning, whereby the predictor module is conditioned on some text input to predict the latent space. Broad-level to specific-level conditioning, and using memory of similar latent spaces, is also something that can be explored. In the end, I believe a hierarchical architecture, going from broad to specific, with each layer of abstraction conditioning on the broader layer of abstraction above, and finally attention between all the generated layers of abstraction (or latent space) to use for prediction could be a better bet. That said, I-JEPA is a promising first step, and I am excited to see what comes next. ~~~~~~~~~~~~~~~~~~ Slides: https://github.com/tanchongmin/Tensor... Reference Materials: I-JEPA: https://arxiv.org/abs/2301.08243 Vision Transformers: https://arxiv.org/abs/2010.11929 Swin Transformers (Transformers with hierarchy and shifting attention windows): https://arxiv.org/abs/2103.14030 MLP-Mixer (All MLP only image processing): https://arxiv.org/abs/2105.01601 Conv-Mixer (Patches with Conv layers): https://arxiv.org/abs/2201.09792 Stable Diffusion: https://arxiv.org/abs/2112.10752 ~~~~~~~~~~~~~~~~~~ (0:00) Introduction 5:54 Transformers: Prediction back in input space 11:12 Prediction in Latent Space 22:25 Stable Diffusion and Latent Space 29:17 Vision Transformer (ViT) 44:57 Swin Transformer 50:12 ViT’s positional encoding may not be good! 51:38 I-JEPA 1:09:26 Discussion on how to improve I-JEPA ~~~~~~~~~~~~~~~~~~~ AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator. Discord:   / discord   LinkedIn:   / chong-min-tan-94652288   Online AI blog: https://delvingintotech.wordpress.com/ Twitter:   / johntanchongmin   Try out my games here: https://simmer.io/@chongmin