The Residual Connection Is Broken. Here's the Fix.

Attention Residuals replaces the standard fixed residual accumulation with softmax attention over previous layer outputs. This enables each layer to selectively combine earlier representations using learned, input-dependent weights. Attention Residuals replaces standard fixed residual accumulation with depth-wise softmax attention over all preceding layer outputs. This enables each layer to combine earlier representations using learned, input-dependent weights. 00:00 Intro to residual connections 03:27 Intuition behind attention residuals 04:43 Full attention residuals 09:43 Block attention residuals 13:07 Parallelism 14:21 Infrastructure design for efficient training 20:03 Infrastructure design for efficient inference 22:01 Discussions 21:02 Related work References: [Attention Residual] https://arxiv.org/abs/2603.15031