Do we need Attention? A Mamba Primer

A Technical Primer on Mamba and Friends. With Yair Schiff (https://yair-schiff.github.io/) Slides: https://github.com/srush/mamba-primer... Main focus: Mamba: Linear-Time Sequence Modeling with Selective State Spaces http://arxiv.org/abs/2312.00752 from Albert Gu and Tri Dao. Simplified State Space Layers for Sequence Modeling http://arxiv.org/abs/2208.04933 from Smith JT, Warrington A, Linderman SW 00:00 - Intro 04:03 - Section 1 - Linear Time Varying recurrences 12:07 - Section 2 - Associative Scan 16:27 - Section 3 - Continuous-Time SSMs 26:55 - Section 4 - Large States and Hardware-Aware Parameterizations 34:56 - Conclusion Yang S,Wang B,Shen Y,Panda R,Kim Y Gated Linear Attention Transformers with Hardware-Efficient Training http://arxiv.org/abs/2312.06635 Arora S,Eyuboglu S,Zhang M,Timalsina A,Alberti S,Zinsley D,Zou J,Rudra A,Ré C Simple linear attention language models balance the recall-throughput tradeoff http://arxiv.org/abs/2402.18668 De S,Smith SL,Fernando A,Botev A,Cristian-Muraru G,Gu A,Haroun R,Berrada L,Chen Y,Srinivasan S,Desjardins G,Doucet A,Budden D,Teh YW,Pascanu R,De Freitas N,Gulcehre C Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models http://arxiv.org/abs/2402.19427 Sun Y,Dong L,Huang S,Ma S,Xia Y,Xue J,Wang J,Wei F Retentive Network: A Successor to Transformer for Large Language Models http://arxiv.org/abs/2307.08621

MambaByte: Token-Free Language Modeling

MambaByte: Token-Free Language Modeling

Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun's $1B Bet Against LLMs [Part 1]

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Intuition behind Mamba and State Space Models | Enhancing LLMs!

Intuition behind Mamba and State Space Models | Enhancing LLMs!

Training Sand to Think: Artificial General Intelligence & Future of Physics

Training Sand to Think: Artificial General Intelligence & Future of Physics

The Pattern Nobody Can Prove (But Everyone Believes)

The Pattern Nobody Can Prove (But Everyone Believes)

How (and why) to take a logarithm of an image

How (and why) to take a logarithm of an image

Do we need Attention? - Linear RNNs and State Space Models (SSMs) for NLP

Do we need Attention? - Linear RNNs and State Space Models (SSMs) for NLP

Speculations on Test-Time Scaling (o1)

Speculations on Test-Time Scaling (o1)

Simple Diffusion Language Models

Simple Diffusion Language Models

How DeepSeek Rewrote the Transformer [MLA]

How DeepSeek Rewrote the Transformer [MLA]

The math behind Attention: Keys, Queries, and Values matrices

The math behind Attention: Keys, Queries, and Values matrices

MAMBA and State Space Models explained | SSM explained

MAMBA and State Space Models explained | SSM explained

Sublime: Sublinear Error & Space for Unbounded Skewed Streams (SIGMOD Best Paper Honorable Mention)

Sublime: Sublinear Error & Space for Unbounded Skewed Streams (SIGMOD Best Paper Honorable Mention)

Mamba architecture intuition | Shawn's ML Notes

Mamba architecture intuition | Shawn's ML Notes

Exposing The Solid State Donut Battery. It's Over.

Exposing The Solid State Donut Battery. It's Over.

Compute-Constrained Data Selection (Junjie Oscar Yin)

Compute-Constrained Data Selection (Junjie Oscar Yin)

Mamba Might Just Make LLMs 1000x Cheaper...

Mamba Might Just Make LLMs 1000x Cheaper...

LoRA explained (and a bit about precision and quantization)

LoRA explained (and a bit about precision and quantization)

Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu - 693

Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu - 693