AI With Rohan

Parallelism in AI - Part 3: Tensor Parallelism

Rohan Rajput — Wed, 11 Mar 2026 00:00:00 GMT

This is Part 3 of the Parallelism in AI series. In Part 1, we covered Data Parallelism & FSDP, and in Part 2, we explored Pipeline Parallelism. Now, we dive into Tensor Parallelism by splitting individual layers across devices.

What is Tensor Parallelism?

In the previous parts, we saw how Data Parallelism replicates the model and splits data, and how Pipeline Parallelism partitions the model by layers. Tensor Parallelism takes this a step further: instead of splitting at the layer level, it splits individual tensors (weight matrices) within a single layer across multiple devices. Each device computes a portion of a layer’s operation simultaneously, enabling us to parallelize even within a single transformer block. This is especially powerful for very large layers that are too memory-intensive for a single GPU.

Column Parallel Linear Layer

The first building block of tensor parallelism is the Column Parallel Linear layer. Given a weight matrix W, we split it along the column dimension into N partitions, one per device. Each device holds a slice W_i and computes Y_i = X · W_i independently using the full input X. Since the columns are independent, no communication is needed during the forward computation itself as each device produces a partial output that corresponds to a subset of the output features. The partial outputs are then concatenated (or used directly by the next layer) to form the complete result.

Row Parallel Linear Layer

The complement to column parallelism is the Row Parallel Linear layer. Here, the weight matrix is split along the row dimension. Each device holds a horizontal slice W_i and receives a corresponding partition of the input. Each device computes a partial matrix multiplication, producing a partial result. These partial results are then summed across devices using an AllReduce operation to produce the final output. Row parallel layers are typically paired with column parallel layers so that the output partitioning of one naturally feeds into the input partitioning of the other, minimizing communication.

Tensor Parallelism in the MLP Block

In a Transformer’s MLP (Feed-Forward) block, there are typically two linear layers with a non-linearity (e.g., GeLU) in between. Tensor parallelism applies column parallelism to the first linear layer and row parallelism to the second. The first layer splits its output features across devices, and the GeLU activation is applied locally on each device. The second layer then takes these partitioned activations as input and performs a row-parallel computation, finishing with an AllReduce to synchronize the output. This design requires only one AllReduce per MLP block in the forward pass, keeping communication overhead minimal.

Tensor Parallelism in the Attention Block

The multi-head attention mechanism is naturally suited for tensor parallelism because attention heads are independent computations. Each device is assigned a subset of the attention heads. The Query, Key, and Value projection matrices are split column-wise so that each device computes projections for its assigned heads. After computing attention independently, the output projection is applied as a row-parallel linear layer, with an AllReduce to combine the results. Just like the MLP block, this requires only one AllReduce per attention block in the forward pass.

Communication: The AllReduce Cost

Tensor parallelism relies on AllReduce operations to synchronize partial results across devices. In each transformer layer, there are two AllReduce operations in the forward pass (one for the attention block and one for the MLP block) and two in the backward pass. Unlike pipeline parallelism, which only communicates between adjacent stages, tensor parallelism requires all-to-all communication within each layer. This makes tensor parallelism most effective when devices are connected via high-bandwidth interconnects (e.g., NVLink within a single node), as the communication latency directly impacts throughput.

Tensor vs Pipeline vs Data Parallelism

Each parallelism strategy operates at a different granularity and serves a different purpose. In practice, state-of-the-art training systems like Megatron-LM combine all three in a 3D parallelism configuration: tensor parallelism within a node (leveraging fast NVLink), pipeline parallelism across nodes, and data parallelism across pipeline replicas.

Feature	Data Parallelism	Pipeline Parallelism	Tensor Parallelism
What is split	Data (batches)	Model (layers)	Model (tensors within layers)
Granularity	Coarse	Medium	Fine
Model per device	Full copy (or sharded/FSDP)	Subset of layers	Subset of each layer
Communication	AllReduce (gradients)	Activations between stages	AllReduce (activations)
Best interconnect	Any	Moderate bandwidth	High bandwidth (NVLink)
Idle time	Minimal	Pipeline bubble	Minimal
Primary benefit	Faster training	Enables larger models	Parallelizes single layers
Typical scope	Across nodes	Across nodes	Within a node

Summary

Tensor Parallelism splits individual weight matrices across devices, enabling parallelism at the finest granularity. Column parallel layers split output features, row parallel layers split input features, and the two are paired together within Transformer MLP and attention blocks to minimize communication. Each transformer block requires only two AllReduce operations in the forward pass. Because of its high communication requirements, tensor parallelism works best within a single node with fast interconnects. Combined with pipeline and data parallelism in a 3D parallelism setup, it is essential for training today’s largest language models.

References

Shoeybi, M., et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053
Narayanan, D., et al. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. arXiv:2104.04473
Korthikanti, V., et al. (2022). Reducing Activation Recomputation in Large Transformer Models. arXiv:2205.05198
Manim Community Edition — Animation engine used for the visuals in this post
Kokoro TTS — Text-to-speech model used to generate the audio narration

Parallelism in AI - Part 2: Pipeline Parallelism

Rohan Rajput — Sun, 08 Mar 2026 00:00:00 GMT

What is Pipeline Parallelism?

In Part 1, we saw how Data Parallelism replicates the entire model across devices and splits the data. But what happens when a model is too large to fit on a single device, even with FSDP? Pipeline Parallelism takes a different approach. Instead of replicating the model, it partitions the model itself across multiple devices. Each device holds a subset of the model’s layers (called a stage), and data flows through the stages sequentially, much like an assembly line in a factory.

How Pipeline Parallelism Works

In pipeline parallelism, the model is split into consecutive groups of layers, and each group is assigned to a different device. During the forward pass, each device processes its layers and sends the activations to the next device. During the backward pass, gradients flow in the reverse direction. This allows us to train models that are too large for a single device’s memory, since each device only needs to store a fraction of the total parameters.

Naive Pipeline Parallelism

The simplest form of pipeline parallelism processes one mini-batch at a time through all stages sequentially. While straightforward, this approach has a major drawback: at any given time, only one device is actively computing while all others sit idle. This means device utilization is roughly 1/N, where N is the number of stages. The idle time wasted across devices is known as the pipeline bubble.

The Bubble Problem

The pipeline bubble is the key inefficiency of naive pipeline parallelism. If we have 4 stages and it takes time t for each stage to process a mini-batch, then during the forward pass, stage 1 finishes at t but stage 4 doesn’t start until 3t. The total idle time across all devices grows linearly with the number of stages. Reducing this bubble is the primary goal of more advanced pipeline scheduling strategies.

Micro-batching (GPipe)

GPipe addresses the bubble problem by splitting each mini-batch into smaller micro-batches. Instead of waiting for an entire mini-batch to pass through all stages, GPipe injects multiple micro-batches into the pipeline in quick succession. This way, while stage 2 is processing micro-batch 1, stage 1 can already start on micro-batch 2. The more micro-batches we use, the smaller the bubble becomes relative to the total computation. Gradients are accumulated across all micro-batches and synchronized at the end.

1F1B Schedule (PipeDream)

The 1F1B (One Forward, One Backward) schedule, introduced by PipeDream, further improves pipeline efficiency. After an initial warm-up phase where forward passes fill the pipeline, each device alternates between one forward pass and one backward pass. This interleaved scheduling reduces peak memory usage compared to GPipe, since devices don’t need to store activations for all micro-batches simultaneously, while maintaining similar pipeline utilization.

Pipeline Parallelism vs Data Parallelism

Pipeline parallelism and data parallelism solve different problems and are often used together. Data parallelism replicates the model and splits the data, which is ideal when the model fits on a single device but you want faster training. Pipeline parallelism splits the model across devices, which is necessary when the model is too large for one device. In practice, large-scale training combines both: the model is partitioned across pipeline stages, and each stage is replicated across multiple devices using data parallelism.

Feature	Data Parallelism	Pipeline Parallelism
What is split	Data (batches)	Model (layers)
Model per device	Full copy	Subset of layers
Primary benefit	Faster training	Enables larger models
Communication	AllReduce (gradients)	Activations between stages
Idle time	Minimal	Pipeline bubble
Memory per device	Full model + optimizer	Partial model + optimizer
Scalability	Limited by model size	Limited by number of layers

Summary

Pipeline Parallelism enables training of models too large for a single device by partitioning layers across multiple devices. The naive approach suffers from the pipeline bubble, where devices sit idle waiting for data to flow through the pipeline. GPipe reduces this bubble through micro-batching, and PipeDream’s 1F1B schedule further optimizes memory usage with interleaved forward and backward passes. Combined with data parallelism, pipeline parallelism is a core building block for training today’s largest AI models.

References

Huang, Y., et al. (2019). GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. NeurIPS 2019
Narayanan, D., et al. (2019). PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019
Narayanan, D., et al. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. arXiv:2104.04473
Manim Community Edition, the animation engine used for the visuals in this post

Parallelism in AI - Part 1: Data Parallelism & FSDP

Rohan Singh Rajput — Fri, 13 Feb 2026 00:00:00 GMT

Why do we need parallelism in AI?

One of the core problems with current neural architecture based AI is that it needs a massive amount of computation during training and inference. To perform these computations, a.k.a. FLOPS (Floating Point Operations), we have to utilize specialized hardware like GPUs, TPUs, NPUs, etc. These hardware devices can take a single instruction and run multiple processes simultaneously. However, there is a limitation of time and memory on a single device that makes training of AI models infeasible. Hence, we leverage multiple devices to speed up our training process and handle large models.

Introduction to Data Parallelism

There are many parallelism techniques that exist for training and inference of AI models. In this section, we will focus on Data Parallelism. This technique is specifically useful during model training. As the name suggests, we shard (or divide) our data into smaller batches and each batch is processed in parallel across multiple devices. Compared to single device batching where we process the entire data one by one in batches, here we utilize multiple devices to speed up the model training process.

In data parallelism, each device maintains a copy of the model parameters. During training, each device processes a different subset of the training data and computes the gradients independently. After computing the gradients, an AllReduce operation is performed to aggregate the gradients across all devices. This ensures that all devices have the same updated model parameters for the next iteration.

What is AllReduce?

We can think of AllReduce as a communication operation that takes the gradients computed by each device and combines them (e.g., by summing) across all devices. This allows each device to have the same updated gradients, which are then used to update the model parameters. AllReduce is a critical component of data parallelism, as it ensures that all devices stay in sync during training.

Let’s talk about FSDP

Fully Sharded Data Parallel (FSDP) is an advanced parallelism strategy that goes beyond traditional data parallelism. In FSDP, the model parameters are sharded (i.e., divided) across multiple devices, rather than each device maintaining a full copy of the model. This allows for training larger models that may not fit into the memory of a single device. FSDP also incorporates techniques to efficiently manage communication and synchronization between devices, making it a powerful tool for training large-scale AI models.

FSDP Workflow

In the FSDP workflow, the model parameters are sharded across multiple devices. During training, each device computes gradients for its shard of the model parameters. The gradients are then communicated between devices to ensure that all devices have the necessary information to update their respective shards of the model. This allows for efficient training of large models while managing memory constraints effectively.

FSDP vs Data Parallel

The difference between FSDP and traditional data parallelism lies in how the model parameters are managed. In data parallelism, each device maintains a full copy of the model parameters, which can lead to memory constraints when training large models. In contrast, FSDP shards the model parameters across multiple devices, allowing for larger models to be trained without running into memory issues. Additionally, FSDP incorporates more efficient communication strategies to manage synchronization between devices, making it a more scalable solution for training large-scale AI models.

Example

Suppose we have a 7 Billion parameter model which we want to train on a 4 GPU node. In DP, we will place the entire 7B model copy on each GPU. However, in FSDP each device will only hold 1/4th of the model parameters.

Feature	Data Parallelism (DP)	FSDP
Model storage	Full copy on every GPU	Sharded across GPUs
Params per GPU	7B (all)	1.75B (1/4th)
Memory per GPU	~42 GB	~10.5 GB
Fits on 40GB A100?	No	Yes
Communication	AllReduce (gradients only)	AllGather + ReduceScatter
Communication cost	Lower	Higher (overlapped with compute)
Complexity	Simple	More complex
Scalability	Limited by GPU memory	Scales to much larger models

FSDP Summary

In summary, Fully Sharded Data Parallel (FSDP) is an advanced parallelism strategy that allows for training larger AI models by sharding the model parameters across multiple devices. It incorporates efficient communication and synchronization techniques to manage the training process effectively. FSDP is a powerful tool for training large-scale AI models while managing memory constraints and ensuring efficient communication between devices.

References

Li, M., et al. (2020). PyTorch Distributed: Experiences on Accelerating Data Parallel Training. arXiv:2006.15704
Zhao, Y., et al. (2023). PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. arXiv:2304.11277
Manim Community Edition — Animation engine used for the visuals in this post
Kokoro TTS — Text-to-speech model used to generate the audio narration

Welcome to AI With Rohan

Rohan Rajput — Fri, 13 Sep 2024 00:00:00 GMT

Welcome!

This is my first blog post on this new site built with Quarto.

What to expect

I’ll be sharing: - Programming tutorials - Technology insights - Personal projects - Thoughts on software development

Stay tuned for more content!