📚 Auto-publish: Add/update 6 blog posts

Generated on: Thu Jan 8 18:13:13 UTC 2026 Source: md-personal repository
2026-01-08 18:13:13 +00:00
parent 3b1396d814
commit f7528b364e
6 changed files with 9 additions and 9 deletions
--- a/content/posts/ppo-for-language-models.md
+++ b/content/posts/ppo-for-language-models.md
@@ -8,7 +8,7 @@ draft: false
 Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don't inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).

 You may have seen diagrams like the one below, which outlines the RLHF training process. It can look intimidating, with a web of interconnected models, losses, and data flows.
-![S3 File](/images/ppo-for-language-models/7713bd3ecf27442e939b9190fa08165d.png)
+![S3 File](http://localhost:4998/attachments/image-3632d923eed983f171fba4341825273101f1fc94.png?client=default&bucket=obsidian)

 This post will decode that diagram, piece by piece. We'll explore the "why" behind each component, moving from high-level concepts to the deep technical reasoning that makes this process work.