Parallelism in AI - Part 3: Tensor Parallelism
This is Part 3 of the Parallelism in AI series. In Part 1, we covered Data Parallelism & FSDP, and in Part 2, we explored Pipeline Parallelism. Now, we dive into Tensor Parallelism by splitting individual layers across devices.
What is Tensor Parallelism?
In the previous parts, we saw how Data Parallelism replicates the model and splits data, and how Pipeline Parallelism partitions the model by layers. Tensor Parallelism takes this a step further: instead of splitting at the layer level, it splits individual tensors (weight matrices) within a single layer across multiple devices. Each device computes a portion of a layer’s operation simultaneously, enabling us to parallelize even within a single transformer block. This is especially powerful for very large layers that are too memory-intensive for a single GPU.

Column Parallel Linear Layer
The first building block of tensor parallelism is the Column Parallel Linear layer. Given a weight matrix W, we split it along the column dimension into N partitions, one per device. Each device holds a slice W_i and computes Y_i = X · W_i independently using the full input X. Since the columns are independent, no communication is needed during the forward computation itself as each device produces a partial output that corresponds to a subset of the output features. The partial outputs are then concatenated (or used directly by the next layer) to form the complete result.

Row Parallel Linear Layer
The complement to column parallelism is the Row Parallel Linear layer. Here, the weight matrix is split along the row dimension. Each device holds a horizontal slice W_i and receives a corresponding partition of the input. Each device computes a partial matrix multiplication, producing a partial result. These partial results are then summed across devices using an AllReduce operation to produce the final output. Row parallel layers are typically paired with column parallel layers so that the output partitioning of one naturally feeds into the input partitioning of the other, minimizing communication.

Tensor Parallelism in the MLP Block
In a Transformer’s MLP (Feed-Forward) block, there are typically two linear layers with a non-linearity (e.g., GeLU) in between. Tensor parallelism applies column parallelism to the first linear layer and row parallelism to the second. The first layer splits its output features across devices, and the GeLU activation is applied locally on each device. The second layer then takes these partitioned activations as input and performs a row-parallel computation, finishing with an AllReduce to synchronize the output. This design requires only one AllReduce per MLP block in the forward pass, keeping communication overhead minimal.

Tensor Parallelism in the Attention Block
The multi-head attention mechanism is naturally suited for tensor parallelism because attention heads are independent computations. Each device is assigned a subset of the attention heads. The Query, Key, and Value projection matrices are split column-wise so that each device computes projections for its assigned heads. After computing attention independently, the output projection is applied as a row-parallel linear layer, with an AllReduce to combine the results. Just like the MLP block, this requires only one AllReduce per attention block in the forward pass.

Communication: The AllReduce Cost
Tensor parallelism relies on AllReduce operations to synchronize partial results across devices. In each transformer layer, there are two AllReduce operations in the forward pass (one for the attention block and one for the MLP block) and two in the backward pass. Unlike pipeline parallelism, which only communicates between adjacent stages, tensor parallelism requires all-to-all communication within each layer. This makes tensor parallelism most effective when devices are connected via high-bandwidth interconnects (e.g., NVLink within a single node), as the communication latency directly impacts throughput.

Tensor vs Pipeline vs Data Parallelism
Each parallelism strategy operates at a different granularity and serves a different purpose. In practice, state-of-the-art training systems like Megatron-LM combine all three in a 3D parallelism configuration: tensor parallelism within a node (leveraging fast NVLink), pipeline parallelism across nodes, and data parallelism across pipeline replicas.
| Feature | Data Parallelism | Pipeline Parallelism | Tensor Parallelism |
|---|---|---|---|
| What is split | Data (batches) | Model (layers) | Model (tensors within layers) |
| Granularity | Coarse | Medium | Fine |
| Model per device | Full copy (or sharded/FSDP) | Subset of layers | Subset of each layer |
| Communication | AllReduce (gradients) | Activations between stages | AllReduce (activations) |
| Best interconnect | Any | Moderate bandwidth | High bandwidth (NVLink) |
| Idle time | Minimal | Pipeline bubble | Minimal |
| Primary benefit | Faster training | Enables larger models | Parallelizes single layers |
| Typical scope | Across nodes | Across nodes | Within a node |

Summary
Tensor Parallelism splits individual weight matrices across devices, enabling parallelism at the finest granularity. Column parallel layers split output features, row parallel layers split input features, and the two are paired together within Transformer MLP and attention blocks to minimize communication. Each transformer block requires only two AllReduce operations in the forward pass. Because of its high communication requirements, tensor parallelism works best within a single node with fast interconnects. Combined with pipeline and data parallelism in a 3D parallelism setup, it is essential for training today’s largest language models.

References
- Shoeybi, M., et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053
- Narayanan, D., et al. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. arXiv:2104.04473
- Korthikanti, V., et al. (2022). Reducing Activation Recomputation in Large Transformer Models. arXiv:2205.05198
- Manim Community Edition — Animation engine used for the visuals in this post
- Kokoro TTS — Text-to-speech model used to generate the audio narration