Parallelism in AI - Part 3: Tensor Parallelism

Author

Rohan Rajput

Published

March 11, 2026

This is Part 3 of the Parallelism in AI series. In Part 1, we covered Data Parallelism & FSDP, and in Part 2, we explored Pipeline Parallelism. Now, we dive into Tensor Parallelism by splitting individual layers across devices.

What is Tensor Parallelism?

In the previous parts, we saw how Data Parallelism replicates the model and splits data, and how Pipeline Parallelism partitions the model by layers. Tensor Parallelism takes this a step further: instead of splitting at the layer level, it splits individual tensors (weight matrices) within a single layer across multiple devices. Each device computes a portion of a layer’s operation simultaneously, enabling us to parallelize even within a single transformer block. This is especially powerful for very large layers that are too memory-intensive for a single GPU.

Column Parallel Linear Layer

The first building block of tensor parallelism is the Column Parallel Linear layer. Given a weight matrix W, we split it along the column dimension into N partitions, one per device. Each device holds a slice W_i and computes Y_i = X · W_i independently using the full input X. Since the columns are independent, no communication is needed during the forward computation itself as each device produces a partial output that corresponds to a subset of the output features. The partial outputs are then concatenated (or used directly by the next layer) to form the complete result.

Row Parallel Linear Layer

The complement to column parallelism is the Row Parallel Linear layer. Here, the weight matrix is split along the row dimension. Each device holds a horizontal slice W_i and receives a corresponding partition of the input. Each device computes a partial matrix multiplication, producing a partial result. These partial results are then summed across devices using an AllReduce operation to produce the final output. Row parallel layers are typically paired with column parallel layers so that the output partitioning of one naturally feeds into the input partitioning of the other, minimizing communication.

Tensor Parallelism in the MLP Block

In a Transformer’s MLP (Feed-Forward) block, there are typically two linear layers with a non-linearity (e.g., GeLU) in between. Tensor parallelism applies column parallelism to the first linear layer and row parallelism to the second. The first layer splits its output features across devices, and the GeLU activation is applied locally on each device. The second layer then takes these partitioned activations as input and performs a row-parallel computation, finishing with an AllReduce to synchronize the output. This design requires only one AllReduce per MLP block in the forward pass, keeping communication overhead minimal.

Tensor Parallelism in the Attention Block

The multi-head attention mechanism is naturally suited for tensor parallelism because attention heads are independent computations. Each device is assigned a subset of the attention heads. The Query, Key, and Value projection matrices are split column-wise so that each device computes projections for its assigned heads. After computing attention independently, the output projection is applied as a row-parallel linear layer, with an AllReduce to combine the results. Just like the MLP block, this requires only one AllReduce per attention block in the forward pass.

Communication: The AllReduce Cost

Tensor parallelism relies on AllReduce operations to synchronize partial results across devices. In each transformer layer, there are two AllReduce operations in the forward pass (one for the attention block and one for the MLP block) and two in the backward pass. Unlike pipeline parallelism, which only communicates between adjacent stages, tensor parallelism requires all-to-all communication within each layer. This makes tensor parallelism most effective when devices are connected via high-bandwidth interconnects (e.g., NVLink within a single node), as the communication latency directly impacts throughput.

Tensor vs Pipeline vs Data Parallelism

Each parallelism strategy operates at a different granularity and serves a different purpose. In practice, state-of-the-art training systems like Megatron-LM combine all three in a 3D parallelism configuration: tensor parallelism within a node (leveraging fast NVLink), pipeline parallelism across nodes, and data parallelism across pipeline replicas.

Feature Data Parallelism Pipeline Parallelism Tensor Parallelism
What is split Data (batches) Model (layers) Model (tensors within layers)
Granularity Coarse Medium Fine
Model per device Full copy (or sharded/FSDP) Subset of layers Subset of each layer
Communication AllReduce (gradients) Activations between stages AllReduce (activations)
Best interconnect Any Moderate bandwidth High bandwidth (NVLink)
Idle time Minimal Pipeline bubble Minimal
Primary benefit Faster training Enables larger models Parallelizes single layers
Typical scope Across nodes Across nodes Within a node

Summary

Tensor Parallelism splits individual weight matrices across devices, enabling parallelism at the finest granularity. Column parallel layers split output features, row parallel layers split input features, and the two are paired together within Transformer MLP and attention blocks to minimize communication. Each transformer block requires only two AllReduce operations in the forward pass. Because of its high communication requirements, tensor parallelism works best within a single node with fast interconnects. Combined with pipeline and data parallelism in a 3D parallelism setup, it is essential for training today’s largest language models.

References