Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

#ai #technology #switchtransformer Scale is the next frontier for AI. Google Brain uses sparsity and hard routing to massively increase a model's parameters, while keeping the FLOPs per forward pass constant. The Switch Transformer compares favorably to its dense counterparts in terms of speed and sample efficiency and breaks the next magic number: One Trillion Parameters. OUTLINE: 0:00 - Intro & Overview 4:30 - Performance Gains from Scale 8:30 - Switch Transformer Architecture 17:00 - Model-, Data- and Expert-Parallelism 25:30 - Experimental Results 29:00 - Stabilizing Training 32:20 - Distillation into Dense Models 33:30 - Final Comments Paper: https://arxiv.org/abs/2101.03961 Codebase T5: https://github.com/google-research/te... Abstract: In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model. Authors: William Fedus, Barret Zoph, Noam Shazeer Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: / yannickilcher Twitter: / ykilcher Discord: / discord BitChute: https://www.bitchute.com/channel/yann... Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: / yannic-kilcher-488534136 BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: / yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)

Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)

Training Sand to Think: Artificial General Intelligence & Future of Physics

Training Sand to Think: Artificial General Intelligence & Future of Physics

Yann LeCun: World Models: Enabling the next AI revolution

Yann LeCun: World Models: Enabling the next AI revolution

Terence Tao: Nobody Understands Why AI Actually Works

Terence Tao: Nobody Understands Why AI Actually Works

AlphaFold - The Most Useful Thing AI Has Ever Done

AlphaFold - The Most Useful Thing AI Has Ever Done

I Gave ChatGPT a Body

I Gave ChatGPT a Body

Something is jamming GPS over Europe. Here's what we found

Something is jamming GPS over Europe. Here's what we found

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

How AI Cracked the Protein Folding Code and Won a Nobel Prize

How AI Cracked the Protein Folding Code and Won a Nobel Prize

EXPOSED: The Dirty Little Secret of AI (On a 1979 PDP-11)

EXPOSED: The Dirty Little Secret of AI (On a 1979 PDP-11)

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

The Hardest Questions in Physics | World Science Festival

The Hardest Questions in Physics | World Science Festival

Why AI Can Never Escape Turing's 1936 Proof

Why AI Can Never Escape Turing's 1936 Proof

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

Every Famous Number, Explained: From Pi to the Unknowable

Every Famous Number, Explained: From Pi to the Unknowable

Noam Shazeer (Google) WeCNLP 2018

Noam Shazeer (Google) WeCNLP 2018

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Yann LeCun's $1B Bet Against LLMs [Part 2]

Yann LeCun's $1B Bet Against LLMs [Part 2]

Physics Ran an Experiment on TIME — The Results Don’t Add Up

Physics Ran an Experiment on TIME — The Results Don’t Add Up