From 5253081653490d5afff4ef79d80911098570a3d5 Mon Sep 17 00:00:00 2001 From: Automated Publisher Date: Sun, 3 Aug 2025 03:49:59 +0000 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=9A=20Auto-publish:=20Add/update=201?= =?UTF-8?q?=20blog=20posts?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Generated on: Sun Aug 3 03:49:57 UTC 2025 Source: md-personal repository --- ...models-challenges-solutions-in-practice.md | 38 +++++++++---------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/content/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice.md b/content/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice.md index da96e33..b77508b 100644 --- a/content/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice.md +++ b/content/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice.md @@ -9,19 +9,19 @@ Mixture-of-Experts (MoEs) are neural network architectures that allow different ### 1. Challenge: Non-Differentiability of Routing Functions **The Problem:** -Many routing mechanisms, especially "Top-K routing," involve a discrete, hard selection process. A common function is `KeepTopK(v, k)`, which selects the top `k` scoring elements from a vector `v` and sets others to $-\infty$ or $0$. +Many routing mechanisms, especially "Top-K routing," involve a discrete, hard selection process. A common function is `KeepTopK(v, k)`, which selects the top `k` scoring elements from a vector `v` and sets others to \(-\infty\) or \(0\). -$$ +\[ KeepTopK(v, k)_i = \begin{cases} v_i & \text{if } v_i \text{ is in the top } k \text{ elements of } v \\ -\infty & \text{otherwise.} \end{cases} -$$ +\] -This function is **not differentiable**. Its gradient is zero almost everywhere and undefined at the threshold points, making it impossible to directly train the gating network's parameters (e.g., $W_g$) using standard gradient descent. +This function is **not differentiable**. Its gradient is zero almost everywhere and undefined at the threshold points, making it impossible to directly train the gating network's parameters (e.g., \(W_g\)) using standard gradient descent. **Solutions (Stochastic Approximations):** To enable end-to-end training, non-differentiable routing decisions must be approximated with differentiable or stochastic methods. * **Stochastic Scoring (e.g., Shazeer et al. 2017):** - The expert score $H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{noise})_i)$ introduces Gaussian noise. This makes the scores themselves stochastic, which can be leveraged with other methods. + The expert score \(H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{noise})_i)\) introduces Gaussian noise. This makes the scores themselves stochastic, which can be leveraged with other methods. * **Gumbel-Softmax Trick (or Concrete Distribution):** This method allows for differentiable sampling from categorical distributions. Instead of directly picking the top-k, Gumbel noise is added to the scores, and a Softmax (with a temperature parameter) is applied. This provides a continuous, differentiable approximation of a discrete choice, allowing gradients to flow back. @@ -33,7 +33,7 @@ To enable end-to-end training, non-differentiable routing decisions must be appr A simpler approximation where, during the backward pass, gradients are treated as if the non-differentiable operation was an identity function or a simple smooth function. * **Softmax after TopK (e.g., Mixtral, DBRX, DeepSeek v3):** - Instead of `Softmax(KeepTopK(...))`, some models apply a Softmax *only to the scores of the selected TopK experts*, and then assign $0$ to the rest. This provides differentiable weights for the selected experts while still enforcing sparsity. + Instead of `Softmax(KeepTopK(...))`, some models apply a Softmax *only to the scores of the selected TopK experts*, and then assign \(0\) to the rest. This provides differentiable weights for the selected experts while still enforcing sparsity. ### 2. Challenge: Uneven Expert Utilization (Balancing Loss) @@ -45,27 +45,27 @@ Left unchecked, the gating network might learn to heavily favor a few experts, l **Solution: Heuristic Balancing Losses (e.g., from Switch Transformer, Fedus et al. 2022)** An auxiliary loss is added to the total model loss during training to encourage more even expert usage. -$$ \text{loss}_{\text{auxiliary}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i $$ +\(\( \text{loss}_{\text{auxiliary}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i \)\) Where: -* $\alpha$: A hyperparameter controlling the strength of the auxiliary loss. -* $N$: Total number of experts. -* $f_i$: The **fraction of tokens *actually dispatched* to expert $i$** in the current batch $B$. - $$ f_i = \frac{1}{T} \sum_{x \in B} \mathbf{1}\{\text{argmax } p(x) = i\} $$ - ($p(x)$ here refers to the output of the gating network, which could be $s_{i,t}$ in the DeepSeek/classic router. The $\text{argmax}$ means it counts hard assignments to expert $i$.) -* $P_i$: The **fraction of the router *probability mass* allocated to expert $i$** in the current batch $B$. - $$ P_i = \frac{1}{T} \sum_{x \in B} p_i(x) $$ - ($p_i(x)$ is the learned probability (or soft score) from the gating network for token $x$ and expert $i$.) +* \(\alpha\): A hyperparameter controlling the strength of the auxiliary loss. +* \(N\): Total number of experts. +* \(f_i\): The **fraction of tokens *actually dispatched* to expert \(i\)** in the current batch \(B\). + \(\( f_i = \frac{1}{T} \sum_{x \in B} \mathbf{1}\{\text{argmax } p(x) = i\} \)\) + (\(p(x)\) here refers to the output of the gating network, which could be \(s_{i,t}\) in the DeepSeek/classic router. The \(\text{argmax}\) means it counts hard assignments to expert \(i\).) +* \(P_i\): The **fraction of the router *probability mass* allocated to expert \(i\)** in the current batch \(B\). + \(\( P_i = \frac{1}{T} \sum_{x \in B} p_i(x) \)\) + (\(p_i(x)\) is the learned probability (or soft score) from the gating network for token \(x\) and expert \(i\).) **How it works:** -The loss aims to minimize the product $f_i \cdot P_i$ when $f_i$ and $P_i$ are small, effectively pushing them to be larger (closer to $1/N$). If an expert $i$ is overused (high $f_i$ and $P_i$), its term in the sum contributes significantly to the loss. The derivative with respect to $p_i(x)$ reveals that "more frequent use = stronger downweighting," meaning the gating network is penalized for sending too much traffic to an already busy expert. +The loss aims to minimize the product \(f_i \cdot P_i\) when \(f_i\) and \(P_i\) are small, effectively pushing them to be larger (closer to \(1/N\)). If an expert \(i\) is overused (high \(f_i\) and \(P_i\)), its term in the sum contributes significantly to the loss. The derivative with respect to \(p_i(x)\) reveals that "more frequent use = stronger downweighting," meaning the gating network is penalized for sending too much traffic to an already busy expert. **Relationship to Gating Network:** -* **$p_i(x)$ (or $s_{i,t}$):** This is the output of the **learned gating network** (e.g., from a linear layer followed by Softmax). The gating network's parameters are updated via gradient descent, influenced by this auxiliary loss. -* **$P_i$:** This is *calculated* from the outputs of the learned gating network for the current batch. It's not a pre-defined value. +* **\(p_i(x)\) (or \(s_{i,t}\)):** This is the output of the **learned gating network** (e.g., from a linear layer followed by Softmax). The gating network's parameters are updated via gradient descent, influenced by this auxiliary loss. +* **\(P_i\):** This is *calculated* from the outputs of the learned gating network for the current batch. It's not a pre-defined value. **Limitation ("Second Best" Scenario):** -Even with this loss, an expert can remain imbalanced if it's consistently the "second best" option (high $P_i$) but never the *absolute top choice* that gets counted in $f_i$ (especially if $K=1$). This is because $f_i$ strictly counts hard assignments based on `argmax`. This limitation highlights why "soft" routing or "softmax after TopK" approaches can be more effective for truly even distribution. +Even with this loss, an expert can remain imbalanced if it's consistently the "second best" option (high \(P_i\)) but never the *absolute top choice* that gets counted in \(f_i\) (especially if \(K=1\)). This is because \(f_i\) strictly counts hard assignments based on `argmax`. This limitation highlights why "soft" routing or "softmax after TopK" approaches can be more effective for truly even distribution. ### 3. Challenge: Overfitting during Fine-tuning