Longformer: The Long-Document Transformer

The Longformer extends the Transformer by introducing sliding window attention and sparse global attention. This allows for the processing of much longer documents than classic models like BERT. Paper: https://arxiv.org/abs/2004.05150 Code: https://github.com/allenai/longformer Abstract: Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. Authors: Iz Beltagy, Matthew E. Peters, Arman Cohan Links: YouTube: / yannickilcher Twitter: / ykilcher BitChute: https://www.bitchute.com/channel/yann... Minds: https://www.minds.com/ykilcher

Big Bird: Transformers for Longer Sequences (Paper Explained)

Big Bird: Transformers for Longer Sequences (Paper Explained)

Attention in transformers, step-by-step | Deep Learning Chapter 6

Attention in transformers, step-by-step | Deep Learning Chapter 6

LSTM is dead. Long Live Transformers!

LSTM is dead. Long Live Transformers!

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)

Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)

Training Sand to Think: Artificial General Intelligence & Future of Physics

Training Sand to Think: Artificial General Intelligence & Future of Physics

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

How AI Cracked the Protein Folding Code and Won a Nobel Prize

How AI Cracked the Protein Folding Code and Won a Nobel Prize

Something is jamming GPS over Europe. Here's what we found

Something is jamming GPS over Europe. Here's what we found

How Senior Engineers Actually Build With AI in 2026 | Build a Full Stack Systems Architecture App

How Senior Engineers Actually Build With AI in 2026 | Build a Full Stack Systems Architecture App

Gradient descent, how neural networks learn | Deep Learning Chapter 2

Gradient descent, how neural networks learn | Deep Learning Chapter 2

Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun's $1B Bet Against LLMs [Part 1]

How DeepSeek Rewrote the Transformer [MLA]

How DeepSeek Rewrote the Transformer [MLA]

The mind-bending reality of quantum mechanics - with Jim Al Khalili

The mind-bending reality of quantum mechanics - with Jim Al Khalili

AlphaFold - The Most Useful Thing AI Has Ever Done

AlphaFold - The Most Useful Thing AI Has Ever Done

Terence Tao: Nobody Understands Why AI Actually Works

Terence Tao: Nobody Understands Why AI Actually Works

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Reformer: The Efficient Transformer

Reformer: The Efficient Transformer

Deep dive - Better Attention layers for Transformer models

Deep dive - Better Attention layers for Transformer models