Parallelism in AI - Part 2: Pipeline Parallelism
What is Pipeline Parallelism?
In Part 1, we saw how Data Parallelism replicates the entire model across devices and splits the data. But what happens when a model is too large to fit on a single device, even with FSDP? Pipeline Parallelism takes a different approach. Instead of replicating the model, it partitions the model itself across multiple devices. Each device holds a subset of the model’s layers (called a stage), and data flows through the stages sequentially, much like an assembly line in a factory.

How Pipeline Parallelism Works
In pipeline parallelism, the model is split into consecutive groups of layers, and each group is assigned to a different device. During the forward pass, each device processes its layers and sends the activations to the next device. During the backward pass, gradients flow in the reverse direction. This allows us to train models that are too large for a single device’s memory, since each device only needs to store a fraction of the total parameters.

Naive Pipeline Parallelism
The simplest form of pipeline parallelism processes one mini-batch at a time through all stages sequentially. While straightforward, this approach has a major drawback: at any given time, only one device is actively computing while all others sit idle. This means device utilization is roughly 1/N, where N is the number of stages. The idle time wasted across devices is known as the pipeline bubble.

The Bubble Problem
The pipeline bubble is the key inefficiency of naive pipeline parallelism. If we have 4 stages and it takes time t for each stage to process a mini-batch, then during the forward pass, stage 1 finishes at t but stage 4 doesn’t start until 3t. The total idle time across all devices grows linearly with the number of stages. Reducing this bubble is the primary goal of more advanced pipeline scheduling strategies.

Micro-batching (GPipe)
GPipe addresses the bubble problem by splitting each mini-batch into smaller micro-batches. Instead of waiting for an entire mini-batch to pass through all stages, GPipe injects multiple micro-batches into the pipeline in quick succession. This way, while stage 2 is processing micro-batch 1, stage 1 can already start on micro-batch 2. The more micro-batches we use, the smaller the bubble becomes relative to the total computation. Gradients are accumulated across all micro-batches and synchronized at the end.

1F1B Schedule (PipeDream)
The 1F1B (One Forward, One Backward) schedule, introduced by PipeDream, further improves pipeline efficiency. After an initial warm-up phase where forward passes fill the pipeline, each device alternates between one forward pass and one backward pass. This interleaved scheduling reduces peak memory usage compared to GPipe, since devices don’t need to store activations for all micro-batches simultaneously, while maintaining similar pipeline utilization.

Pipeline Parallelism vs Data Parallelism
Pipeline parallelism and data parallelism solve different problems and are often used together. Data parallelism replicates the model and splits the data, which is ideal when the model fits on a single device but you want faster training. Pipeline parallelism splits the model across devices, which is necessary when the model is too large for one device. In practice, large-scale training combines both: the model is partitioned across pipeline stages, and each stage is replicated across multiple devices using data parallelism.
| Feature | Data Parallelism | Pipeline Parallelism |
|---|---|---|
| What is split | Data (batches) | Model (layers) |
| Model per device | Full copy | Subset of layers |
| Primary benefit | Faster training | Enables larger models |
| Communication | AllReduce (gradients) | Activations between stages |
| Idle time | Minimal | Pipeline bubble |
| Memory per device | Full model + optimizer | Partial model + optimizer |
| Scalability | Limited by model size | Limited by number of layers |

Summary
Pipeline Parallelism enables training of models too large for a single device by partitioning layers across multiple devices. The naive approach suffers from the pipeline bubble, where devices sit idle waiting for data to flow through the pipeline. GPipe reduces this bubble through micro-batching, and PipeDream’s 1F1B schedule further optimizes memory usage with interleaved forward and backward passes. Combined with data parallelism, pipeline parallelism is a core building block for training today’s largest AI models.

References
- Huang, Y., et al. (2019). GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. NeurIPS 2019
- Narayanan, D., et al. (2019). PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019
- Narayanan, D., et al. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. arXiv:2104.04473
- Manim Community Edition, the animation engine used for the visuals in this post