📚 Auto-publish: Add/update 5 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 39s

Generated on: Sun Aug  3 03:15:05 UTC 2025
Source: md-personal repository
This commit is contained in:
Automated Publisher
2025-08-03 03:15:05 +00:00
parent 38bbe8cbae
commit 84b3c2016e
5 changed files with 121 additions and 3 deletions

View File

@@ -0,0 +1 @@
Pasted image 20250730232756.png|.png

View File

@@ -1,6 +1,6 @@
---
title: "A Deep Dive into PPO for Language Models"
date: 2025-08-03T01:47:10
date: 2025-08-03T03:14:20
draft: false
---
@@ -9,7 +9,7 @@ Large Language Models (LLMs) have demonstrated astonishing capabilities, but out
You may have seen diagrams like the one below, which outlines the RLHF training process. It can look intimidating, with a web of interconnected models, losses, and data flows.
![[Pasted image 20250730232756.png]]
![](/images/a-deep-dive-into-ppo-for-language-models/.png)
This post will decode that diagram, piece by piece. We'll explore the "why" behind each component, moving from high-level concepts to the deep technical reasoning that makes this process work.

View File

@@ -0,0 +1,117 @@
---
title: "Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice"
date: 2025-08-03T03:14:20
draft: false
---
Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called "experts") to specialize in different types of inputs. A "gating network" or "router" learns to dispatch each input (or "token") to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.
### 1. Challenge: Non-Differentiability of Routing Functions
**The Problem:**
Many routing mechanisms, especially "Top-K routing," involve a discrete, hard selection process. A common function is `KeepTopK(v, k)`, which selects the top `k` scoring elements from a vector `v` and sets others to $-\infty$ or $0$.
$$
KeepTopK(v, k)_i = \begin{cases} v_i & \text{if } v_i \text{ is in the top } k \text{ elements of } v \\ -\infty & \text{otherwise.} \end{cases}
$$
This function is **not differentiable**. Its gradient is zero almost everywhere and undefined at the threshold points, making it impossible to directly train the gating network's parameters (e.g., $W_g$) using standard gradient descent.
**Solutions (Stochastic Approximations):**
To enable end-to-end training, non-differentiable routing decisions must be approximated with differentiable or stochastic methods.
* **Stochastic Scoring (e.g., Shazeer et al. 2017):**
The expert score $H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{noise})_i)$ introduces Gaussian noise. This makes the scores themselves stochastic, which can be leveraged with other methods.
* **Gumbel-Softmax Trick (or Concrete Distribution):**
This method allows for differentiable sampling from categorical distributions. Instead of directly picking the top-k, Gumbel noise is added to the scores, and a Softmax (with a temperature parameter) is applied. This provides a continuous, differentiable approximation of a discrete choice, allowing gradients to flow back.
* **REINFORCE (Score Function Estimator):**
This is a policy gradient method from reinforcement learning. The routing decision is treated as an action, and the gating network's parameters are updated based on the "reward" (e.g., the model's performance). Gradients are estimated by sampling routing choices and weighting them by their outcomes.
* **Straight-Through Estimator (STE):**
A simpler approximation where, during the backward pass, gradients are treated as if the non-differentiable operation was an identity function or a simple smooth function.
* **Softmax after TopK (e.g., Mixtral, DBRX, DeepSeek v3):**
Instead of `Softmax(KeepTopK(...))`, some models apply a Softmax *only to the scores of the selected TopK experts*, and then assign $0$ to the rest. This provides differentiable weights for the selected experts while still enforcing sparsity.
### 2. Challenge: Uneven Expert Utilization (Balancing Loss)
**The Problem:**
Left unchecked, the gating network might learn to heavily favor a few experts, leaving others underutilized. This leads to:
* **System Inefficiency:** Overloaded experts become bottlenecks, while underutilized experts waste computational resources.
* **Suboptimal Learning:** Experts might not specialize effectively if they don't receive diverse data.
**Solution: Heuristic Balancing Losses (e.g., from Switch Transformer, Fedus et al. 2022)**
An auxiliary loss is added to the total model loss during training to encourage more even expert usage.
$$ \text{loss}_{\text{auxiliary}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i $$
Where:
* $\alpha$: A hyperparameter controlling the strength of the auxiliary loss.
* $N$: Total number of experts.
* $f_i$: The **fraction of tokens *actually dispatched* to expert $i$** in the current batch $B$.
$$ f_i = \frac{1}{T} \sum_{x \in B} \mathbf{1}\{\text{argmax } p(x) = i\} $$
($p(x)$ here refers to the output of the gating network, which could be $s_{i,t}$ in the DeepSeek/classic router. The $\text{argmax}$ means it counts hard assignments to expert $i$.)
* $P_i$: The **fraction of the router *probability mass* allocated to expert $i$** in the current batch $B$.
$$ P_i = \frac{1}{T} \sum_{x \in B} p_i(x) $$
($p_i(x)$ is the learned probability (or soft score) from the gating network for token $x$ and expert $i$.)
**How it works:**
The loss aims to minimize the product $f_i \cdot P_i$ when $f_i$ and $P_i$ are small, effectively pushing them to be larger (closer to $1/N$). If an expert $i$ is overused (high $f_i$ and $P_i$), its term in the sum contributes significantly to the loss. The derivative with respect to $p_i(x)$ reveals that "more frequent use = stronger downweighting," meaning the gating network is penalized for sending too much traffic to an already busy expert.
**Relationship to Gating Network:**
* **$p_i(x)$ (or $s_{i,t}$):** This is the output of the **learned gating network** (e.g., from a linear layer followed by Softmax). The gating network's parameters are updated via gradient descent, influenced by this auxiliary loss.
* **$P_i$:** This is *calculated* from the outputs of the learned gating network for the current batch. It's not a pre-defined value.
**Limitation ("Second Best" Scenario):**
Even with this loss, an expert can remain imbalanced if it's consistently the "second best" option (high $P_i$) but never the *absolute top choice* that gets counted in $f_i$ (especially if $K=1$). This is because $f_i$ strictly counts hard assignments based on `argmax`. This limitation highlights why "soft" routing or "softmax after TopK" approaches can be more effective for truly even distribution.
### 3. Challenge: Overfitting during Fine-tuning
**The Problem:**
Sparse MoE models, despite only activating a few experts per token, possess a very large total number of parameters. When fine-tuning these models on **smaller datasets**, they are highly prone to **overfitting**. The model's vast capacity allows it to memorize the limited fine-tuning data, leading to poor generalization performance on unseen validation data. This is evident when training loss continues to decrease, but validation loss stagnates or increases.
**Solutions:**
* **Zoph et al. Solution Fine-tune non-MoE MLPs:**
* This strategy involves freezing a portion of the MoE model's parameters during fine-tuning, specifically the large expert weights.
* Instead, only the "non-MoE" parameters (e.g., attention layers, adapter layers, or the gating network itself) are updated.
* This reduces the effective number of trainable parameters during fine-tuning, thereby mitigating the risk of overfitting on small datasets. It assumes the experts are already well-pre-trained for general tasks.
* **DeepSeek Solution Use Lots of Data (1.4M SFT):**
* This approach tackles the problem by providing the model with a very large and diverse dataset for Supervised Fine-Tuning (SFT).
* With abundant data (e.g., 1.4 million examples covering a wide range of tasks and languages), the model's large capacity can be effectively utilized for specialized learning rather than memorization. The diversity and volume of data prevent individual experts from overfitting to specific examples.
**Conclusion:**
MoE models offer significant advantages in terms of model capacity and computational efficiency, but their unique sparse activation pattern introduces challenges in training and fine-tuning. Overcoming non-differentiability in routing and ensuring balanced expert utilization are crucial for effective pre-training. During fine-tuning, managing the model's vast parameter count to prevent overfitting on smaller datasets requires either strategic parameter freezing or access to very large and diverse fine-tuning data.
The **Top-K routing** mechanism, as illustrated in the provided image, is a core component in many modern Mixture-of-Experts (MoE) models. It involves selecting a fixed number (`K`) of experts for each input based on relevance scores.
---
**Traditional Top-K (Deterministic Selection):**
* **How it works:**
1. Calculate relevance scores (`s_{i,t}`) for each expert `i` and input `t`.
2. Identify the `K` experts with the highest scores.
3. Experts *within* the Top-K are assigned their scores (`g_{i,t} = s_{i,t}`).
4. Experts *outside* the Top-K are assigned a score of `0` (`g_{i,t} = 0`).
5. The output is a weighted sum of the selected experts' outputs.
* **Pros:** Predictable, deterministic, selects the "best" experts based on current scores.
* **Cons:** Can lead to expert imbalance, where a few popular experts are always chosen, starving others of training.
**Alternative: Sampling from Softmax (Probabilistic Selection):**
* **How it works:**
1. Calculate relevance scores (`s_{i,t}`) which are treated as probabilities (after softmax).
2. **Randomly sample** `K` unique expert indices from the distribution defined by these probabilities.
3. Selected experts contribute; unselected experts do not.
* **Why it's suggested:**
* **Load Balancing:** Prevents expert collapse by ensuring all experts get a chance to be selected, even those with slightly lower scores. This promotes more even training across the entire expert pool.
* **Diversity & Exploration:** Introduces randomness, potentially leading to better generalization and robustness by exploring different expert combinations.
* **Pros:** Better load balancing, prevents expert starvation, encourages exploration.
* **Cons:** Stochastic (non-deterministic routing), can make debugging harder, might not pick the absolute "best" expert in a single instance (but better for long-term training).
**Key Takeaway:** While deterministic Top-K is simpler and directly picks the "highest-scoring" experts, sampling from the softmax offers a more robust training dynamic by ensuring that all experts receive training data, thereby preventing some experts from becoming unused ("dead experts").
---

View File

@@ -1,6 +1,6 @@
---
title: "T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive"
date: 2025-08-03T01:47:10
date: 2025-08-03T03:14:20
draft: false
---

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.2 MiB