ModelParallelism Tensor Parallism

peered inside the transformer and saw matrix multiplication everywhere: Y = X × W. two beautiful properties: Column split: X × [W₁ | W₂] = [X×W₁ | X×W₂] Row split: X × [W₁] = X₁×W₁ + X₂×W₂ [W₂] Applied to a transformer MLP block (Linear1 → GELU → Linear2): GPU 0: X → Linear1_colA → GELU → Linear2_rowA ─┐ ├─ All-Reduce → Output GPU 1: X → Linear1_colB → GELU → Linear2_rowB ─┘ By pairing column-parallel with row-parallel, we only needed one All-Reduce per block. For attention it was even cleaner: each GPU just owned a subset of heads. But TP has a dark side: its All-Reduces are in the critical path. You can't hide them. That's why TP only works well inside a single node with NVLink typically TP ≤ 8. A companion trick, Sequence Parallelism, splits the leftover operations (LayerNorm, Dropout) along the sequence dimension to keep activations sharded everywhere.