deploy: eba296fed3

2025-08-03 06:15:25 +00:00
parent a50fee0dcf
commit a9192dd7da
14 changed files with 37 additions and 37 deletions
--- a/posts/a-deep-dive-into-ppo-for-language-models/index.html
+++ b/posts/a-deep-dive-into-ppo-for-language-models/index.html
@@ -19,8 +19,8 @@ where <code>δ_t = r_t + γV(s_{t+1}) - V(s_t)</code></p><ul><li><strong>γ (gam
 <a class=heading-link href=#avoiding-amnesia-the-pretraining-loss><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
 <span class=sr-only>Link to heading</span></a></h3><p>There&rsquo;s one final problem. If we only optimize for the PPO loss, the model might learn to &ldquo;hack&rdquo; the reward model by generating repetitive or nonsensical text that gets a high score. In doing so, it could suffer from <strong>catastrophic forgetting</strong>, losing its fundamental grasp of grammar and facts.</p><p>To prevent this, we introduce a second loss term. As seen in the diagram, we mix in data from the original <strong>Pretraining Data</strong> (or the dataset used for Supervised Fine-Tuning). We calculate a standard next-token prediction loss (<code>LM Loss</code>) on this high-quality data.</p><p>The final loss for the Actor is a combination of both objectives:</p><p><strong>Total Loss = Loss_PPO + <code>λ_ptx</code> * Loss_LM</strong></p><p>This brilliantly balances two goals:</p><ol><li>The <code>Loss_PPO</code> pushes the model towards behaviors that align with human preferences.</li><li>The <code>Loss_LM</code> acts as a regularizer, pulling the model back towards its core language capabilities and preventing it from drifting into gibberish.</li></ol><h3 id=the-full-training-loop>The Full Training Loop
 <a class=heading-link href=#the-full-training-loop><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
-<span class=sr-only>Link to heading</span></a></h3><p>Now, we can assemble the entire process into a clear, iterative loop:</p><ol><li><strong>Collect</strong>: The current Actor policy <code>π_k</code> generates responses to a batch of prompts. These experiences—<code>(state, action, probability, reward, value)</code>—are stored in an <strong>Experience Buffer</strong>.</li><li><strong>Calculate</strong>: Once the buffer is full, we use the collected data to compute the advantage estimates <code>Â_t</code> for every single token-generation step.</li><li><strong>Optimize</strong>: For a few epochs, we repeatedly sample mini-batches from the buffer and update the Actor and Critic models. The Actor is updated using the combined <code>PPO-clip Loss</code> and <code>LM Loss</code>. The Critic is updated to improve its value predictions.</li><li><strong>Flush and Repeat</strong>: After the optimization phase, the entire experience buffer is discarded. The data is now &ldquo;stale&rdquo; because our policy has changed. The newly updated policy <code>π_{k+1}</code> becomes the new Actor, and we return to step 1 to collect fresh data.</li></ol><p>This cycle of collection and optimization allows the language model to gradually and safely steer its behavior towards human-defined goals, creating the helpful and aligned AI assistants we interact with today.</p><hr><p><strong>References:</strong></p><ol><li>Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). <em>Proximal Policy Optimization Algorithms</em>. arXiv preprint arXiv:1707.06347.</li><li>Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). <em>High-Dimensional Continuous Control Using Generalized Advantage Estimation</em>. arXiv preprint arXiv:1506.02438.</li><li>Ouyang, L., et al. (2022). <em>Training language models to follow instructions with human feedback</em>. Advances in Neural Information Processing Systems 35.</li></ol></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article></section></div><footer class=footer><section class=container>©
+<span class=sr-only>Link to heading</span></a></h3><p>Now, we can assemble the entire process into a clear, iterative loop:</p><ol><li><strong>Collect</strong>: The current Actor policy <code>π_k</code> generates responses to a batch of prompts. These experiences—<code>(state, action, probability, reward, value)</code>—are stored in an <strong>Experience Buffer</strong>.</li><li><strong>Calculate</strong>: Once the buffer is full, we use the collected data to compute the advantage estimates <code>Â_t</code> for every single token-generation step.</li><li><strong>Optimize</strong>: For a few epochs, we repeatedly sample mini-batches from the buffer and update the Actor and Critic models. The Actor is updated using the combined <code>PPO-clip Loss</code> and <code>LM Loss</code>. The Critic is updated to improve its value predictions.</li><li><strong>Flush and Repeat</strong>: After the optimization phase, the entire experience buffer is discarded. The data is now &ldquo;stale&rdquo; because our policy has changed. The newly updated policy <code>π_{k+1}</code> becomes the new Actor, and we return to step 1 to collect fresh data.</li></ol><p>This cycle of collection and optimization allows the language model to gradually and safely steer its behavior towards human-defined goals, creating the helpful and aligned AI assistants we interact with today.</p><hr><p><strong>References:</strong></p><ol><li>Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). <em>Proximal Policy Optimization Algorithms</em>. arXiv preprint arXiv:1707.06347.</li><li>Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). <em>High-Dimensional Continuous Control Using Generalized Advantage Estimation</em>. arXiv preprint arXiv:1506.02438.</li><li>Ouyang, L., et al. (2022). <em>Training language models to follow instructions with human feedback</em>. Advances in Neural Information Processing Systems 35.</li></ol></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css integrity=sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0 crossorigin=anonymous><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js integrity=sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4 crossorigin=anonymous></script><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js integrity=sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05 crossorigin=anonymous onload='renderMathInElement(document.body,{delimiters:[{left:"$$",right:"$$",display:!0},{left:"$",right:"$",display:!1},{left:"\\(",right:"\\)",display:!1},{left:"\\[",right:"\\]",display:!0}]})'></script></section></div><footer class=footer><section class=container>©
 2016 -
 2025
 Eric X. Liu
-<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/5706ff7">[5706ff7]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>
+<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/eba296f">[eba296f]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>