γ (gam
Link to headingThere’s one final problem. If we only optimize for the PPO loss, the model might learn to “hack” the reward model by generating repetitive or nonsensical text that gets a high score. In doing so, it could suffer from catastrophic forgetting, losing its fundamental grasp of grammar and facts.
To prevent this, we introduce a second loss term. As seen in the diagram, we mix in data from the original Pretraining Data (or the dataset used for Supervised Fine-Tuning). We calculate a standard next-token prediction loss (LM Loss
) on this high-quality data.
The final loss for the Actor is a combination of both objectives:
Total Loss = Loss_PPO + λ_ptx
* Loss_LM
This brilliantly balances two goals:
- The
Loss_PPO
pushes the model towards behaviors that align with human preferences. - The
Loss_LM
acts as a regularizer, pulling the model back towards its core language capabilities and preventing it from drifting into gibberish.
Now, we can assemble the entire process into a clear, iterative loop:
- Collect: The current Actor policy
π_k
generates responses to a batch of prompts. These experiences—(state, action, probability, reward, value)
—are stored in an Experience Buffer. - Calculate: Once the buffer is full, we use the collected data to compute the advantage estimates
Â_t
for every single token-generation step. - Optimize: For a few epochs, we repeatedly sample mini-batches from the buffer and update the Actor and Critic models. The Actor is updated using the combined
PPO-clip Loss
and LM Loss
. The Critic is updated to improve its value predictions. - Flush and Repeat: After the optimization phase, the entire experience buffer is discarded. The data is now “stale” because our policy has changed. The newly updated policy
π_{k+1}
becomes the new Actor, and we return to step 1 to collect fresh data.
This cycle of collection and optimization allows the language model to gradually and safely steer its behavior towards human-defined goals, creating the helpful and aligned AI assistants we interact with today.
References:
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
- Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv preprint arXiv:1506.02438.
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35.