DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Research Explained)

#dino #facebook #selfsupervised Self-Supervised Learning is the final frontier in Representation Learning: Getting useful features without any labels. Facebook AI's new system, DINO, combines advances in Self-Supervised Learning for Computer Vision with the new Vision Transformer (ViT) architecture and achieves impressive results without any labels. Attention maps can be directly interpreted as segmentation maps, and the obtained representations can be used for image retrieval and zero-shot k-nearest neighbor classifiers (KNNs). OUTLINE: 0:00 - Intro & Overview 6:20 - Vision Transformers 9:20 - Self-Supervised Learning for Images 13:30 - Self-Distillation 15:20 - Building the teacher from the student by moving average 16:45 - DINO Pseudocode 23:10 - Why Cross-Entropy Loss? 28:20 - Experimental Results 33:40 - My Hypothesis why this works 38:45 - Conclusion & Comments Paper: https://arxiv.org/abs/2104.14294 Blog:   / dino-paws-computer-vision-with-self-superv...   Code: https://github.com/facebookresearch/dino My Video on ViT:    • An Image is Worth 16x16 Words: Transformer...   My Video on BYOL:    • BYOL: Bootstrap Your Own Latent: A New App...   Abstract: In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base. Authors: Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube:    / yannickilcher   Twitter:   / ykilcher   Discord:   / discord   BitChute: https://www.bitchute.com/channel/yann... Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn:   / yannic-kilcher-488534136   BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon:   / yannickilcher   Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

MLP-Mixer: An all-MLP Architecture for Vision (Machine Learning Research Paper Explained)
▶︎

MLP-Mixer: An all-MLP Architecture for Vision (Machine Learning Research Paper Explained)

Yann LeCun | Self-Supervised Learning, JEPA, World Models, and the future of AI
▶︎

Yann LeCun | Self-Supervised Learning, JEPA, World Models, and the future of AI

DINO: Self-Supervised Vision Transformers
▶︎

DINO: Self-Supervised Vision Transformers

What is SonarQube | Introduction SonarQube | SonarQube Tutorial | SonarQube Basics | Intellipaat
▶︎

What is SonarQube | Introduction SonarQube | SonarQube Tutorial | SonarQube Basics | Intellipaat

How AI Taught Itself to See [DINOv3]
▶︎

How AI Taught Itself to See [DINOv3]

Rich Sutton, The OaK Architecture: A Vision of SuperIntelligence from Experience - RLC 2025
▶︎

Rich Sutton, The OaK Architecture: A Vision of SuperIntelligence from Experience - RLC 2025

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)
▶︎

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

LambdaNetworks: Modeling long-range Interactions without Attention (Paper Explained)
▶︎

LambdaNetworks: Modeling long-range Interactions without Attention (Paper Explained)

Jfrog | Jfrog Artifactory | Jfrog Artifactory Tutorial | Artifactory Tutorial | Intellipaat
▶︎

Jfrog | Jfrog Artifactory | Jfrog Artifactory Tutorial | Artifactory Tutorial | Intellipaat

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer
▶︎

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer

OpenAI CLIP: ConnectingText and Images (Paper Explained)
▶︎

OpenAI CLIP: ConnectingText and Images (Paper Explained)

DINOv2 Explained: Visual Model Insights & Comprehensive Code Guide
▶︎

DINOv2 Explained: Visual Model Insights & Comprehensive Code Guide

XCiT: Cross-Covariance Image Transformers (Facebook AI Machine Learning Research Paper Explained)
▶︎

XCiT: Cross-Covariance Image Transformers (Facebook AI Machine Learning Research Paper Explained)

Self-Supervised Learning, JEPA, World Models and the Future of AI by Prof. Yann LeCun From NYU
▶︎

Self-Supervised Learning, JEPA, World Models and the Future of AI by Prof. Yann LeCun From NYU

Stanford CS25: V1 I Transformers in Vision: Tackling problems in Computer Vision
▶︎

Stanford CS25: V1 I Transformers in Vision: Tackling problems in Computer Vision

DINOv3: One backbone, multiple image/video tasks
▶︎

DINOv3: One backbone, multiple image/video tasks

Vision Transformer Basics
▶︎

Vision Transformer Basics

Diffusion Models Explained : From DDPM to Stable Diffusion
▶︎

Diffusion Models Explained : From DDPM to Stable Diffusion

Python Tuple | Python Tuple Tutorial | Python Training | Intellipaat
▶︎

Python Tuple | Python Tuple Tutorial | Python Training | Intellipaat

Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)
▶︎

Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)