Files
ericxliu-me/content/posts
Eric Liu 88cbb7efd5
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 14s
(posts): add deep dive into PPO for language models post
This commit introduces a new blog post detailing the Proximal Policy Optimization (PPO) algorithm as used in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs).

The post covers:
- The mapping of RL concepts to text generation.
- The roles of the Actor, Critic, and Reward Model.
- The use of Generalized Advantage Estimation (GAE) for stable credit assignment.
- The PPO clipped surrogate objective for safe policy updates.
- The importance of pretraining loss to prevent catastrophic forgetting.
- The full iterative training loop.
2025-08-02 15:46:24 -07:00
..
2020-10-26 04:47:36 +00:00