<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>AI With Rohan</title>
<link>https://rohanrajput04.github.io/posts.html</link>
<atom:link href="https://rohanrajput04.github.io/posts.xml" rel="self" type="application/rss+xml"/>
<description>Deep dives into AI, ML, and distributed systems</description>
<generator>quarto-1.8.27</generator>
<lastBuildDate>Wed, 11 Mar 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>Parallelism in AI - Part 3: Tensor Parallelism</title>
  <dc:creator>Rohan Rajput</dc:creator>
  <link>https://rohanrajput04.github.io/posts/tensor-parallelism/</link>
  <description><![CDATA[ 





<p><em>This is Part 3 of the <strong>Parallelism in AI</strong> series. In <a href="../../posts/data-parallelism/index.html">Part 1</a>, we covered Data Parallelism &amp; FSDP, and in <a href="../../posts/pipeline-parallelism/index.html">Part 2</a>, we explored Pipeline Parallelism. Now, we dive into Tensor Parallelism by splitting individual layers across devices.</em></p>
<section id="what-is-tensor-parallelism" class="level1">
<h1>What is Tensor Parallelism?</h1>
<audio controls="" class="section-audio">
<source src="audio/what_is_tensor_parallelism.wav" type="audio/wav">
</audio>
<p>In the previous parts, we saw how Data Parallelism replicates the model and splits data, and how Pipeline Parallelism partitions the model by layers. Tensor Parallelism takes this a step further: instead of splitting at the layer level, it splits <strong>individual tensors</strong> (weight matrices) within a single layer across multiple devices. Each device computes a portion of a layer’s operation simultaneously, enabling us to parallelize even within a single transformer block. This is especially powerful for very large layers that are too memory-intensive for a single GPU.</p>
<p><img src="https://rohanrajput04.github.io/posts/tensor-parallelism/assets/TensorParallelismIntro_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="column-parallel-linear-layer" class="level1">
<h1>Column Parallel Linear Layer</h1>
<audio controls="" class="section-audio">
<source src="audio/column_parallel_linear.wav" type="audio/wav">
</audio>
<p>The first building block of tensor parallelism is the <strong>Column Parallel Linear</strong> layer. Given a weight matrix <strong>W</strong>, we split it along the column dimension into N partitions, one per device. Each device holds a slice <strong>W_i</strong> and computes <strong>Y_i = X · W_i</strong> independently using the full input <strong>X</strong>. Since the columns are independent, no communication is needed during the forward computation itself as each device produces a partial output that corresponds to a subset of the output features. The partial outputs are then concatenated (or used directly by the next layer) to form the complete result.</p>
<p><img src="https://rohanrajput04.github.io/posts/tensor-parallelism/assets/ColumnParallelLinear_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="row-parallel-linear-layer" class="level1">
<h1>Row Parallel Linear Layer</h1>
<audio controls="" class="section-audio">
<source src="audio/row_parallel_linear.wav" type="audio/wav">
</audio>
<p>The complement to column parallelism is the <strong>Row Parallel Linear</strong> layer. Here, the weight matrix is split along the row dimension. Each device holds a horizontal slice <strong>W_i</strong> and receives a corresponding partition of the input. Each device computes a partial matrix multiplication, producing a partial result. These partial results are then summed across devices using an <strong>AllReduce</strong> operation to produce the final output. Row parallel layers are typically paired with column parallel layers so that the output partitioning of one naturally feeds into the input partitioning of the other, minimizing communication.</p>
<p><img src="https://rohanrajput04.github.io/posts/tensor-parallelism/assets/RowParallelLinear_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="tensor-parallelism-in-the-mlp-block" class="level1">
<h1>Tensor Parallelism in the MLP Block</h1>
<audio controls="" class="section-audio">
<source src="audio/mlp_tensor_parallel.wav" type="audio/wav">
</audio>
<p>In a Transformer’s <strong>MLP (Feed-Forward)</strong> block, there are typically two linear layers with a non-linearity (e.g., GeLU) in between. Tensor parallelism applies column parallelism to the first linear layer and row parallelism to the second. The first layer splits its output features across devices, and the GeLU activation is applied locally on each device. The second layer then takes these partitioned activations as input and performs a row-parallel computation, finishing with an AllReduce to synchronize the output. This design requires only <strong>one AllReduce per MLP block</strong> in the forward pass, keeping communication overhead minimal.</p>
<p><img src="https://rohanrajput04.github.io/posts/tensor-parallelism/assets/MLPTensorParallel_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="tensor-parallelism-in-the-attention-block" class="level1">
<h1>Tensor Parallelism in the Attention Block</h1>
<audio controls="" class="section-audio">
<source src="audio/attention_tensor_parallel.wav" type="audio/wav">
</audio>
<p>The multi-head attention mechanism is naturally suited for tensor parallelism because attention heads are <strong>independent</strong> computations. Each device is assigned a subset of the attention heads. The Query, Key, and Value projection matrices are split column-wise so that each device computes projections for its assigned heads. After computing attention independently, the output projection is applied as a row-parallel linear layer, with an AllReduce to combine the results. Just like the MLP block, this requires only <strong>one AllReduce per attention block</strong> in the forward pass.</p>
<p><img src="https://rohanrajput04.github.io/posts/tensor-parallelism/assets/AttentionTensorParallel_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="communication-the-allreduce-cost" class="level1">
<h1>Communication: The AllReduce Cost</h1>
<audio controls="" class="section-audio">
<source src="audio/allreduce_cost.wav" type="audio/wav">
</audio>
<p>Tensor parallelism relies on <strong>AllReduce</strong> operations to synchronize partial results across devices. In each transformer layer, there are two AllReduce operations in the forward pass (one for the attention block and one for the MLP block) and two in the backward pass. Unlike pipeline parallelism, which only communicates between adjacent stages, tensor parallelism requires <strong>all-to-all communication</strong> within each layer. This makes tensor parallelism most effective when devices are connected via high-bandwidth interconnects (e.g., NVLink within a single node), as the communication latency directly impacts throughput.</p>
<p><img src="https://rohanrajput04.github.io/posts/tensor-parallelism/assets/AllReduceVisualization_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="tensor-vs-pipeline-vs-data-parallelism" class="level1">
<h1>Tensor vs Pipeline vs Data Parallelism</h1>
<audio controls="" class="section-audio">
<source src="audio/tensor_vs_pipeline_vs_data.wav" type="audio/wav">
</audio>
<p>Each parallelism strategy operates at a different granularity and serves a different purpose. In practice, state-of-the-art training systems like Megatron-LM combine all three in a <strong>3D parallelism</strong> configuration: tensor parallelism within a node (leveraging fast NVLink), pipeline parallelism across nodes, and data parallelism across pipeline replicas.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 19%">
<col style="width: 25%">
<col style="width: 26%">
<col style="width: 27%">
</colgroup>
<thead>
<tr class="header">
<th>Feature</th>
<th>Data Parallelism</th>
<th>Pipeline Parallelism</th>
<th>Tensor Parallelism</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>What is split</td>
<td>Data (batches)</td>
<td>Model (layers)</td>
<td>Model (tensors within layers)</td>
</tr>
<tr class="even">
<td>Granularity</td>
<td>Coarse</td>
<td>Medium</td>
<td>Fine</td>
</tr>
<tr class="odd">
<td>Model per device</td>
<td>Full copy (or sharded/FSDP)</td>
<td>Subset of layers</td>
<td>Subset of each layer</td>
</tr>
<tr class="even">
<td>Communication</td>
<td>AllReduce (gradients)</td>
<td>Activations between stages</td>
<td>AllReduce (activations)</td>
</tr>
<tr class="odd">
<td>Best interconnect</td>
<td>Any</td>
<td>Moderate bandwidth</td>
<td>High bandwidth (NVLink)</td>
</tr>
<tr class="even">
<td>Idle time</td>
<td>Minimal</td>
<td>Pipeline bubble</td>
<td>Minimal</td>
</tr>
<tr class="odd">
<td>Primary benefit</td>
<td>Faster training</td>
<td>Enables larger models</td>
<td>Parallelizes single layers</td>
</tr>
<tr class="even">
<td>Typical scope</td>
<td>Across nodes</td>
<td>Across nodes</td>
<td>Within a node</td>
</tr>
</tbody>
</table>
<p><img src="https://rohanrajput04.github.io/posts/tensor-parallelism/assets/TensorVsPipelineVsData_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="summary" class="level1">
<h1>Summary</h1>
<audio controls="" class="section-audio">
<source src="audio/summary.wav" type="audio/wav">
</audio>
<p>Tensor Parallelism splits individual weight matrices across devices, enabling parallelism at the finest granularity. Column parallel layers split output features, row parallel layers split input features, and the two are paired together within Transformer MLP and attention blocks to minimize communication. Each transformer block requires only two AllReduce operations in the forward pass. Because of its high communication requirements, tensor parallelism works best within a single node with fast interconnects. Combined with pipeline and data parallelism in a 3D parallelism setup, it is essential for training today’s largest language models.</p>
<p><img src="https://rohanrajput04.github.io/posts/tensor-parallelism/assets/TensorParallelismSummary_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="references" class="level1">
<h1>References</h1>
<ul>
<li>Shoeybi, M., et al.&nbsp;(2019). <a href="https://arxiv.org/abs/1909.08053">Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism</a>. <em>arXiv:1909.08053</em></li>
<li>Narayanan, D., et al.&nbsp;(2021). <a href="https://arxiv.org/abs/2104.04473">Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM</a>. <em>arXiv:2104.04473</em></li>
<li>Korthikanti, V., et al.&nbsp;(2022). <a href="https://arxiv.org/abs/2205.05198">Reducing Activation Recomputation in Large Transformer Models</a>. <em>arXiv:2205.05198</em></li>
<li><a href="https://github.com/ManimCommunity/manim">Manim Community Edition</a> — Animation engine used for the visuals in this post</li>
<li><a href="https://github.com/hexgrad/kokoro">Kokoro TTS</a> — Text-to-speech model used to generate the audio narration</li>
</ul>


</section>

 ]]></description>
  <guid>https://rohanrajput04.github.io/posts/tensor-parallelism/</guid>
  <pubDate>Wed, 11 Mar 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Parallelism in AI - Part 2: Pipeline Parallelism</title>
  <dc:creator>Rohan Rajput</dc:creator>
  <link>https://rohanrajput04.github.io/posts/pipeline-parallelism/</link>
  <description><![CDATA[ 





<section id="what-is-pipeline-parallelism" class="level1">
<h1>What is Pipeline Parallelism?</h1>
<audio controls="" class="section-audio">
<source src="audio/what_is_pipeline_parallelism.wav" type="audio/wav">
</audio>
<p>In <a href="../../posts/data-parallelism/index.html">Part 1</a>, we saw how Data Parallelism replicates the entire model across devices and splits the data. But what happens when a model is too large to fit on a single device, even with FSDP? Pipeline Parallelism takes a different approach. Instead of replicating the model, it partitions the model itself across multiple devices. Each device holds a subset of the model’s layers (called a <strong>stage</strong>), and data flows through the stages sequentially, much like an assembly line in a factory.</p>
<p><img src="https://rohanrajput04.github.io/posts/pipeline-parallelism/assets/PipelineParallelismIntro_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="how-pipeline-parallelism-works" class="level1">
<h1>How Pipeline Parallelism Works</h1>
<audio controls="" class="section-audio">
<source src="audio/how_pipeline_parallelism_works.wav" type="audio/wav">
</audio>
<p>In pipeline parallelism, the model is split into consecutive groups of layers, and each group is assigned to a different device. During the forward pass, each device processes its layers and sends the activations to the next device. During the backward pass, gradients flow in the reverse direction. This allows us to train models that are too large for a single device’s memory, since each device only needs to store a fraction of the total parameters.</p>
<p><img src="https://rohanrajput04.github.io/posts/pipeline-parallelism/assets/PipelineParallelismDetailed_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="naive-pipeline-parallelism" class="level1">
<h1>Naive Pipeline Parallelism</h1>
<audio controls="" class="section-audio">
<source src="audio/naive_pipeline_parallelism.wav" type="audio/wav">
</audio>
<p>The simplest form of pipeline parallelism processes one mini-batch at a time through all stages sequentially. While straightforward, this approach has a major drawback: at any given time, only one device is actively computing while all others sit idle. This means device utilization is roughly 1/N, where N is the number of stages. The idle time wasted across devices is known as the <strong>pipeline bubble</strong>.</p>
<p><img src="https://rohanrajput04.github.io/posts/pipeline-parallelism/assets/NaivePipelineParallelism_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="the-bubble-problem" class="level1">
<h1>The Bubble Problem</h1>
<audio controls="" class="section-audio">
<source src="audio/bubble_problem.wav" type="audio/wav">
</audio>
<p>The pipeline bubble is the key inefficiency of naive pipeline parallelism. If we have 4 stages and it takes time <em>t</em> for each stage to process a mini-batch, then during the forward pass, stage 1 finishes at <em>t</em> but stage 4 doesn’t start until <em>3t</em>. The total idle time across all devices grows linearly with the number of stages. Reducing this bubble is the primary goal of more advanced pipeline scheduling strategies.</p>
<p><img src="https://rohanrajput04.github.io/posts/pipeline-parallelism/assets/PipelineBubbleAnalysis_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="micro-batching-gpipe" class="level1">
<h1>Micro-batching (GPipe)</h1>
<audio controls="" class="section-audio">
<source src="audio/gpipe_microbatching.wav" type="audio/wav">
</audio>
<p>GPipe addresses the bubble problem by splitting each mini-batch into smaller <strong>micro-batches</strong>. Instead of waiting for an entire mini-batch to pass through all stages, GPipe injects multiple micro-batches into the pipeline in quick succession. This way, while stage 2 is processing micro-batch 1, stage 1 can already start on micro-batch 2. The more micro-batches we use, the smaller the bubble becomes relative to the total computation. Gradients are accumulated across all micro-batches and synchronized at the end.</p>
<p><img src="https://rohanrajput04.github.io/posts/pipeline-parallelism/assets/GPipeSchedule_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="f1b-schedule-pipedream" class="level1">
<h1>1F1B Schedule (PipeDream)</h1>
<audio controls="" class="section-audio">
<source src="audio/one_f_one_b_schedule.wav" type="audio/wav">
</audio>
<p>The <strong>1F1B (One Forward, One Backward)</strong> schedule, introduced by PipeDream, further improves pipeline efficiency. After an initial warm-up phase where forward passes fill the pipeline, each device alternates between one forward pass and one backward pass. This interleaved scheduling reduces peak memory usage compared to GPipe, since devices don’t need to store activations for all micro-batches simultaneously, while maintaining similar pipeline utilization.</p>
<p><img src="https://rohanrajput04.github.io/posts/pipeline-parallelism/assets/OneFOneBSchedule_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="pipeline-parallelism-vs-data-parallelism" class="level1">
<h1>Pipeline Parallelism vs Data Parallelism</h1>
<audio controls="" class="section-audio">
<source src="audio/pipeline_vs_data_parallelism.wav" type="audio/wav">
</audio>
<p>Pipeline parallelism and data parallelism solve different problems and are often used together. Data parallelism replicates the model and splits the data, which is ideal when the model fits on a single device but you want faster training. Pipeline parallelism splits the model across devices, which is necessary when the model is too large for one device. In practice, large-scale training combines both: the model is partitioned across pipeline stages, and each stage is replicated across multiple devices using data parallelism.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 27%">
<col style="width: 35%">
<col style="width: 37%">
</colgroup>
<thead>
<tr class="header">
<th>Feature</th>
<th>Data Parallelism</th>
<th>Pipeline Parallelism</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>What is split</td>
<td>Data (batches)</td>
<td>Model (layers)</td>
</tr>
<tr class="even">
<td>Model per device</td>
<td>Full copy</td>
<td>Subset of layers</td>
</tr>
<tr class="odd">
<td>Primary benefit</td>
<td>Faster training</td>
<td>Enables larger models</td>
</tr>
<tr class="even">
<td>Communication</td>
<td>AllReduce (gradients)</td>
<td>Activations between stages</td>
</tr>
<tr class="odd">
<td>Idle time</td>
<td>Minimal</td>
<td>Pipeline bubble</td>
</tr>
<tr class="even">
<td>Memory per device</td>
<td>Full model + optimizer</td>
<td>Partial model + optimizer</td>
</tr>
<tr class="odd">
<td>Scalability</td>
<td>Limited by model size</td>
<td>Limited by number of layers</td>
</tr>
</tbody>
</table>
<p><img src="https://rohanrajput04.github.io/posts/pipeline-parallelism/assets/PipelineVsDataParallelism_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="summary" class="level1">
<h1>Summary</h1>
<audio controls="" class="section-audio">
<source src="audio/summary.wav" type="audio/wav">
</audio>
<p>Pipeline Parallelism enables training of models too large for a single device by partitioning layers across multiple devices. The naive approach suffers from the pipeline bubble, where devices sit idle waiting for data to flow through the pipeline. GPipe reduces this bubble through micro-batching, and PipeDream’s 1F1B schedule further optimizes memory usage with interleaved forward and backward passes. Combined with data parallelism, pipeline parallelism is a core building block for training today’s largest AI models.</p>
<p><img src="https://rohanrajput04.github.io/posts/pipeline-parallelism/assets/PipelineParallelismSummary_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="references" class="level1">
<h1>References</h1>
<ul>
<li>Huang, Y., et al.&nbsp;(2019). <a href="https://arxiv.org/abs/1811.06965">GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism</a>. <em>NeurIPS 2019</em></li>
<li>Narayanan, D., et al.&nbsp;(2019). <a href="https://arxiv.org/abs/1806.03377">PipeDream: Generalized Pipeline Parallelism for DNN Training</a>. <em>SOSP 2019</em></li>
<li>Narayanan, D., et al.&nbsp;(2021). <a href="https://arxiv.org/abs/2104.04473">Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM</a>. <em>arXiv:2104.04473</em></li>
<li><a href="https://github.com/ManimCommunity/manim">Manim Community Edition</a>, the animation engine used for the visuals in this post</li>
</ul>


</section>

 ]]></description>
  <guid>https://rohanrajput04.github.io/posts/pipeline-parallelism/</guid>
  <pubDate>Sun, 08 Mar 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Parallelism in AI - Part 1: Data Parallelism &amp; FSDP</title>
  <dc:creator>Rohan Singh Rajput</dc:creator>
  <link>https://rohanrajput04.github.io/posts/data-parallelism/</link>
  <description><![CDATA[ 





<section id="why-do-we-need-parallelism-in-ai" class="level1">
<h1>Why do we need parallelism in AI?</h1>
<audio controls="" class="section-audio">
<source src="audio/what_is_parallelism.wav" type="audio/wav">
</audio>
<p>One of the core problems with current neural architecture based AI is that it needs a massive amount of computation during training and inference. To perform these computations, a.k.a. FLOPS (Floating Point Operations), we have to utilize specialized hardware like GPUs, TPUs, NPUs, etc. These hardware devices can take a single instruction and run multiple processes simultaneously. However, there is a limitation of time and memory on a single device that makes training of AI models infeasible. Hence, we leverage multiple devices to speed up our training process and handle large models.</p>
</section>
<section id="introduction-to-data-parallelism" class="level1">
<h1>Introduction to Data Parallelism</h1>
<audio controls="" class="section-audio">
<source src="audio/intro_data_parallelism.wav" type="audio/wav">
</audio>
<p>There are many parallelism techniques that exist for training and inference of AI models. In this section, we will focus on Data Parallelism. This technique is specifically useful during model training. As the name suggests, we shard (or divide) our data into smaller batches and each batch is processed in parallel across multiple devices. Compared to single device batching where we process the entire data one by one in batches, here we utilize multiple devices to speed up the model training process.</p>
<p><img src="https://rohanrajput04.github.io/posts/data-parallelism/assets/DataParallelismIntro_ManimCE_v0.19.2.gif" class="img-fluid"></p>
<audio controls="" class="section-audio">
<source src="audio/data_parallelism_detail.wav" type="audio/wav">
</audio>
<p>In data parallelism, each device maintains a copy of the model parameters. During training, each device processes a different subset of the training data and computes the gradients independently. After computing the gradients, an AllReduce operation is performed to aggregate the gradients across all devices. This ensures that all devices have the same updated model parameters for the next iteration.</p>
<p><img src="https://rohanrajput04.github.io/posts/data-parallelism/assets/DataParallelismDetailed_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="what-is-allreduce" class="level1">
<h1>What is AllReduce?</h1>
<audio controls="" class="section-audio">
<source src="audio/allreduce_explained.wav" type="audio/wav">
</audio>
<p>We can think of AllReduce as a communication operation that takes the gradients computed by each device and combines them (e.g., by summing) across all devices. This allows each device to have the same updated gradients, which are then used to update the model parameters. AllReduce is a critical component of data parallelism, as it ensures that all devices stay in sync during training.</p>
<p><img src="https://rohanrajput04.github.io/posts/data-parallelism/assets/AllReduceExplained_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="lets-talk-about-fsdp" class="level1">
<h1>Let’s talk about FSDP</h1>
<audio controls="" class="section-audio">
<source src="audio/what_is_fsdp.wav" type="audio/wav">
</audio>
<p>Fully Sharded Data Parallel (FSDP) is an advanced parallelism strategy that goes beyond traditional data parallelism. In FSDP, the model parameters are sharded (i.e., divided) across multiple devices, rather than each device maintaining a full copy of the model. This allows for training larger models that may not fit into the memory of a single device. FSDP also incorporates techniques to efficiently manage communication and synchronization between devices, making it a powerful tool for training large-scale AI models.</p>
<p><img src="https://rohanrajput04.github.io/posts/data-parallelism/assets/FSDPExplained_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="fsdp-workflow" class="level1">
<h1>FSDP Workflow</h1>
<audio controls="" class="section-audio">
<source src="audio/fsdp_workflow.wav" type="audio/wav">
</audio>
<p>In the FSDP workflow, the model parameters are sharded across multiple devices. During training, each device computes gradients for its shard of the model parameters. The gradients are then communicated between devices to ensure that all devices have the necessary information to update their respective shards of the model. This allows for efficient training of large models while managing memory constraints effectively.</p>
<p><img src="https://rohanrajput04.github.io/posts/data-parallelism/assets/FSDPWorkflow_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="fsdp-vs-data-parallel" class="level1">
<h1>FSDP vs Data Parallel</h1>
<audio controls="" class="section-audio">
<source src="audio/fsdp_vs_data_parallel.wav" type="audio/wav">
</audio>
<p>The difference between FSDP and traditional data parallelism lies in how the model parameters are managed. In data parallelism, each device maintains a full copy of the model parameters, which can lead to memory constraints when training large models. In contrast, FSDP shards the model parameters across multiple devices, allowing for larger models to be trained without running into memory issues. Additionally, FSDP incorporates more efficient communication strategies to manage synchronization between devices, making it a more scalable solution for training large-scale AI models.</p>
<section id="example" class="level2">
<h2 class="anchored" data-anchor-id="example">Example</h2>
<p>Suppose we have a 7 Billion parameter model which we want to train on a 4 GPU node. In DP, we will place the entire 7B model copy on each GPU. However, in FSDP each device will only hold 1/4th of the model parameters.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 26%">
<col style="width: 35%">
<col style="width: 37%">
</colgroup>
<thead>
<tr class="header">
<th>Feature</th>
<th>Data Parallelism (DP)</th>
<th>FSDP</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Model storage</td>
<td>Full copy on every GPU</td>
<td>Sharded across GPUs</td>
</tr>
<tr class="even">
<td>Params per GPU</td>
<td>7B (all)</td>
<td>1.75B (1/4th)</td>
</tr>
<tr class="odd">
<td>Memory per GPU</td>
<td>~42 GB</td>
<td>~10.5 GB</td>
</tr>
<tr class="even">
<td>Fits on 40GB A100?</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr class="odd">
<td>Communication</td>
<td>AllReduce (gradients only)</td>
<td>AllGather + ReduceScatter</td>
</tr>
<tr class="even">
<td>Communication cost</td>
<td>Lower</td>
<td>Higher (overlapped with compute)</td>
</tr>
<tr class="odd">
<td>Complexity</td>
<td>Simple</td>
<td>More complex</td>
</tr>
<tr class="even">
<td>Scalability</td>
<td>Limited by GPU memory</td>
<td>Scales to much larger models</td>
</tr>
</tbody>
</table>
<p><img src="https://rohanrajput04.github.io/posts/data-parallelism/assets/FSDPvsDataParallel_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
</section>
<section id="fsdp-summary" class="level1">
<h1>FSDP Summary</h1>
<audio controls="" class="section-audio">
<source src="audio/fsdp_summary.wav" type="audio/wav">
</audio>
<p>In summary, Fully Sharded Data Parallel (FSDP) is an advanced parallelism strategy that allows for training larger AI models by sharding the model parameters across multiple devices. It incorporates efficient communication and synchronization techniques to manage the training process effectively. FSDP is a powerful tool for training large-scale AI models while managing memory constraints and ensuring efficient communication between devices.</p>
<p><img src="https://rohanrajput04.github.io/posts/data-parallelism/assets/FSDPSummary_ManimCE_v0.19.2.gif" class="img-fluid"></p>
</section>
<section id="references" class="level1">
<h1>References</h1>
<ul>
<li>Li, M., et al.&nbsp;(2020). <a href="https://arxiv.org/abs/2006.15704">PyTorch Distributed: Experiences on Accelerating Data Parallel Training</a>. <em>arXiv:2006.15704</em></li>
<li>Zhao, Y., et al.&nbsp;(2023). <a href="https://arxiv.org/abs/2304.11277">PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel</a>. <em>arXiv:2304.11277</em></li>
<li><a href="https://github.com/ManimCommunity/manim">Manim Community Edition</a> — Animation engine used for the visuals in this post</li>
<li><a href="https://github.com/hexgrad/kokoro">Kokoro TTS</a> — Text-to-speech model used to generate the audio narration</li>
</ul>


</section>

 ]]></description>
  <guid>https://rohanrajput04.github.io/posts/data-parallelism/</guid>
  <pubDate>Fri, 13 Feb 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Welcome to AI With Rohan</title>
  <dc:creator>Rohan Rajput</dc:creator>
  <link>https://rohanrajput04.github.io/posts/welcome-post/</link>
  <description><![CDATA[ 





<section id="welcome" class="level1">
<h1>Welcome!</h1>
<p>This is my first blog post on this new site built with Quarto.</p>
<section id="what-to-expect" class="level2">
<h2 class="anchored" data-anchor-id="what-to-expect">What to expect</h2>
<p>I’ll be sharing: - Programming tutorials - Technology insights - Personal projects - Thoughts on software development</p>
<p>Stay tuned for more content!</p>


</section>
</section>

 ]]></description>
  <category>general</category>
  <guid>https://rohanrajput04.github.io/posts/welcome-post/</guid>
  <pubDate>Fri, 13 Sep 2024 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>
