deploy: 23b9adc43a

2025-08-03 03:20:19 +00:00
parent 76c539f415
commit 8c3be83b91
14 changed files with 85 additions and 19 deletions
--- a/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html
+++ b/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html
@@ -0,0 +1,47 @@
+<!doctype html><html lang=en><head><title>Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice · Eric X. Liu's Personal Page</title><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=color-scheme content="light dark"><meta name=author content="Eric X. Liu"><meta name=description content="Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called &ldquo;experts&rdquo;) to specialize in different types of inputs. A &ldquo;gating network&rdquo; or &ldquo;router&rdquo; learns to dispatch each input (or &ldquo;token&rdquo;) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.
+
+  1. Challenge: Non-Differentiability of Routing Functions
+  
+    
+    Link to heading
+  
+
+The Problem:
+Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discrete, hard selection process. A common function is KeepTopK(v, k), which selects the top k scoring elements from a vector v and sets others to $-\infty$ or $0$."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=fediverse:creator content><meta name=twitter:card content="summary"><meta name=twitter:title content="Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice"><meta name=twitter:description content="Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called “experts”) to specialize in different types of inputs. A “gating network” or “router” learns to dispatch each input (or “token”) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.
+1. Challenge: Non-Differentiability of Routing Functions Link to heading The Problem: Many routing mechanisms, especially “Top-K routing,” involve a discrete, hard selection process. A common function is KeepTopK(v, k), which selects the top k scoring elements from a vector v and sets others to $-\infty$ or $0$."><meta property="og:url" content="/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice"><meta property="og:description" content="Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called “experts”) to specialize in different types of inputs. A “gating network” or “router” learns to dispatch each input (or “token”) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.
+1. Challenge: Non-Differentiability of Routing Functions Link to heading The Problem: Many routing mechanisms, especially “Top-K routing,” involve a discrete, hard selection process. A common function is KeepTopK(v, k), which selects the top k scoring elements from a vector v and sets others to $-\infty$ or $0$."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2025-08-03T03:19:06+00:00"><meta property="article:modified_time" content="2025-08-03T03:19:53+00:00"><link rel=canonical href=/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.60f552a2c0452fcc0254c54f21c3e0728460c1ae85f97a9c35833a222ef8b884.css integrity="sha256-YPVSosBFL8wCVMVPIcPgcoRgwa6F+XqcNYM6Ii74uIQ=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=/>Eric X. Liu's Personal Page
+</a><input type=checkbox id=menu-toggle>
+<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=https://rss.ericxliu.me/oauth2/oidc/redirect>RSS</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/>Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
+<time datetime=2025-08-03T03:19:06Z>August 3, 2025
+</time></span><span class=reading-time><i class="fa-solid fa-clock" aria-hidden=true></i>
+7-minute read</span></div></div></header><div class=post-content><p>Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called &ldquo;experts&rdquo;) to specialize in different types of inputs. A &ldquo;gating network&rdquo; or &ldquo;router&rdquo; learns to dispatch each input (or &ldquo;token&rdquo;) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.</p><h3 id=1-challenge-non-differentiability-of-routing-functions>1. Challenge: Non-Differentiability of Routing Functions
+<a class=heading-link href=#1-challenge-non-differentiability-of-routing-functions><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h3><p><strong>The Problem:</strong>
+Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discrete, hard selection process. A common function is <code>KeepTopK(v, k)</code>, which selects the top <code>k</code> scoring elements from a vector <code>v</code> and sets others to $-\infty$ or $0$.</p><p>$$
+KeepTopK(v, k)_i = \begin{cases} v_i & \text{if } v_i \text{ is in the top } k \text{ elements of } v \ -\infty & \text{otherwise.} \end{cases}
+$$</p><p>This function is <strong>not differentiable</strong>. Its gradient is zero almost everywhere and undefined at the threshold points, making it impossible to directly train the gating network&rsquo;s parameters (e.g., $W_g$) using standard gradient descent.</p><p><strong>Solutions (Stochastic Approximations):</strong>
+To enable end-to-end training, non-differentiable routing decisions must be approximated with differentiable or stochastic methods.</p><ul><li><p><strong>Stochastic Scoring (e.g., Shazeer et al. 2017):</strong>
+The expert score $H(x)_i = (x \cdot W_g)<em>i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W</em>{noise})_i)$ introduces Gaussian noise. This makes the scores themselves stochastic, which can be leveraged with other methods.</p></li><li><p><strong>Gumbel-Softmax Trick (or Concrete Distribution):</strong>
+This method allows for differentiable sampling from categorical distributions. Instead of directly picking the top-k, Gumbel noise is added to the scores, and a Softmax (with a temperature parameter) is applied. This provides a continuous, differentiable approximation of a discrete choice, allowing gradients to flow back.</p></li><li><p><strong>REINFORCE (Score Function Estimator):</strong>
+This is a policy gradient method from reinforcement learning. The routing decision is treated as an action, and the gating network&rsquo;s parameters are updated based on the &ldquo;reward&rdquo; (e.g., the model&rsquo;s performance). Gradients are estimated by sampling routing choices and weighting them by their outcomes.</p></li><li><p><strong>Straight-Through Estimator (STE):</strong>
+A simpler approximation where, during the backward pass, gradients are treated as if the non-differentiable operation was an identity function or a simple smooth function.</p></li><li><p><strong>Softmax after TopK (e.g., Mixtral, DBRX, DeepSeek v3):</strong>
+Instead of <code>Softmax(KeepTopK(...))</code>, some models apply a Softmax <em>only to the scores of the selected TopK experts</em>, and then assign $0$ to the rest. This provides differentiable weights for the selected experts while still enforcing sparsity.</p></li></ul><h3 id=2-challenge-uneven-expert-utilization-balancing-loss>2. Challenge: Uneven Expert Utilization (Balancing Loss)
+<a class=heading-link href=#2-challenge-uneven-expert-utilization-balancing-loss><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h3><p><strong>The Problem:</strong>
+Left unchecked, the gating network might learn to heavily favor a few experts, leaving others underutilized. This leads to:</p><ul><li><strong>System Inefficiency:</strong> Overloaded experts become bottlenecks, while underutilized experts waste computational resources.</li><li><strong>Suboptimal Learning:</strong> Experts might not specialize effectively if they don&rsquo;t receive diverse data.</li></ul><p><strong>Solution: Heuristic Balancing Losses (e.g., from Switch Transformer, Fedus et al. 2022)</strong>
+An auxiliary loss is added to the total model loss during training to encourage more even expert usage.</p><p>$$ \text{loss}<em>{\text{auxiliary}} = \alpha \cdot N \cdot \sum</em>{i=1}^{N} f_i \cdot P_i $$</p><p>Where:</p><ul><li>$\alpha$: A hyperparameter controlling the strength of the auxiliary loss.</li><li>$N$: Total number of experts.</li><li>$f_i$: The <strong>fraction of tokens <em>actually dispatched</em> to expert $i$</strong> in the current batch $B$.
+$$ f_i = \frac{1}{T} \sum_{x \in B} \mathbf{1}{\text{argmax } p(x) = i} $$
+($p(x)$ here refers to the output of the gating network, which could be $s_{i,t}$ in the DeepSeek/classic router. The $\text{argmax}$ means it counts hard assignments to expert $i$.)</li><li>$P_i$: The <strong>fraction of the router <em>probability mass</em> allocated to expert $i$</strong> in the current batch $B$.
+$$ P_i = \frac{1}{T} \sum_{x \in B} p_i(x) $$
+($p_i(x)$ is the learned probability (or soft score) from the gating network for token $x$ and expert $i$.)</li></ul><p><strong>How it works:</strong>
+The loss aims to minimize the product $f_i \cdot P_i$ when $f_i$ and $P_i$ are small, effectively pushing them to be larger (closer to $1/N$). If an expert $i$ is overused (high $f_i$ and $P_i$), its term in the sum contributes significantly to the loss. The derivative with respect to $p_i(x)$ reveals that &ldquo;more frequent use = stronger downweighting,&rdquo; meaning the gating network is penalized for sending too much traffic to an already busy expert.</p><p><strong>Relationship to Gating Network:</strong></p><ul><li><strong>$p_i(x)$ (or $s_{i,t}$):</strong> This is the output of the <strong>learned gating network</strong> (e.g., from a linear layer followed by Softmax). The gating network&rsquo;s parameters are updated via gradient descent, influenced by this auxiliary loss.</li><li><strong>$P_i$:</strong> This is <em>calculated</em> from the outputs of the learned gating network for the current batch. It&rsquo;s not a pre-defined value.</li></ul><p><strong>Limitation (&ldquo;Second Best&rdquo; Scenario):</strong>
+Even with this loss, an expert can remain imbalanced if it&rsquo;s consistently the &ldquo;second best&rdquo; option (high $P_i$) but never the <em>absolute top choice</em> that gets counted in $f_i$ (especially if $K=1$). This is because $f_i$ strictly counts hard assignments based on <code>argmax</code>. This limitation highlights why &ldquo;soft&rdquo; routing or &ldquo;softmax after TopK&rdquo; approaches can be more effective for truly even distribution.</p><h3 id=3-challenge-overfitting-during-fine-tuning>3. Challenge: Overfitting during Fine-tuning
+<a class=heading-link href=#3-challenge-overfitting-during-fine-tuning><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h3><p><strong>The Problem:</strong>
+Sparse MoE models, despite only activating a few experts per token, possess a very large total number of parameters. When fine-tuning these models on <strong>smaller datasets</strong>, they are highly prone to <strong>overfitting</strong>. The model&rsquo;s vast capacity allows it to memorize the limited fine-tuning data, leading to poor generalization performance on unseen validation data. This is evident when training loss continues to decrease, but validation loss stagnates or increases.</p><p><strong>Solutions:</strong></p><ul><li><p><strong>Zoph et al. Solution – Fine-tune non-MoE MLPs:</strong></p><ul><li>This strategy involves freezing a portion of the MoE model&rsquo;s parameters during fine-tuning, specifically the large expert weights.</li><li>Instead, only the &ldquo;non-MoE&rdquo; parameters (e.g., attention layers, adapter layers, or the gating network itself) are updated.</li><li>This reduces the effective number of trainable parameters during fine-tuning, thereby mitigating the risk of overfitting on small datasets. It assumes the experts are already well-pre-trained for general tasks.</li></ul></li><li><p><strong>DeepSeek Solution – Use Lots of Data (1.4M SFT):</strong></p><ul><li>This approach tackles the problem by providing the model with a very large and diverse dataset for Supervised Fine-Tuning (SFT).</li><li>With abundant data (e.g., 1.4 million examples covering a wide range of tasks and languages), the model&rsquo;s large capacity can be effectively utilized for specialized learning rather than memorization. The diversity and volume of data prevent individual experts from overfitting to specific examples.</li></ul></li></ul><p><strong>Conclusion:</strong>
+MoE models offer significant advantages in terms of model capacity and computational efficiency, but their unique sparse activation pattern introduces challenges in training and fine-tuning. Overcoming non-differentiability in routing and ensuring balanced expert utilization are crucial for effective pre-training. During fine-tuning, managing the model&rsquo;s vast parameter count to prevent overfitting on smaller datasets requires either strategic parameter freezing or access to very large and diverse fine-tuning data.
+The <strong>Top-K routing</strong> mechanism, as illustrated in the provided image, is a core component in many modern Mixture-of-Experts (MoE) models. It involves selecting a fixed number (<code>K</code>) of experts for each input based on relevance scores.</p><hr><p><strong>Traditional Top-K (Deterministic Selection):</strong></p><ul><li><strong>How it works:</strong><ol><li>Calculate relevance scores (<code>s_{i,t}</code>) for each expert <code>i</code> and input <code>t</code>.</li><li>Identify the <code>K</code> experts with the highest scores.</li><li>Experts <em>within</em> the Top-K are assigned their scores (<code>g_{i,t} = s_{i,t}</code>).</li><li>Experts <em>outside</em> the Top-K are assigned a score of <code>0</code> (<code>g_{i,t} = 0</code>).</li><li>The output is a weighted sum of the selected experts&rsquo; outputs.</li></ol></li><li><strong>Pros:</strong> Predictable, deterministic, selects the &ldquo;best&rdquo; experts based on current scores.</li><li><strong>Cons:</strong> Can lead to expert imbalance, where a few popular experts are always chosen, starving others of training.</li></ul><p><strong>Alternative: Sampling from Softmax (Probabilistic Selection):</strong></p><ul><li><strong>How it works:</strong><ol><li>Calculate relevance scores (<code>s_{i,t}</code>) which are treated as probabilities (after softmax).</li><li><strong>Randomly sample</strong> <code>K</code> unique expert indices from the distribution defined by these probabilities.</li><li>Selected experts contribute; unselected experts do not.</li></ol></li><li><strong>Why it&rsquo;s suggested:</strong><ul><li><strong>Load Balancing:</strong> Prevents expert collapse by ensuring all experts get a chance to be selected, even those with slightly lower scores. This promotes more even training across the entire expert pool.</li><li><strong>Diversity & Exploration:</strong> Introduces randomness, potentially leading to better generalization and robustness by exploring different expert combinations.</li></ul></li><li><strong>Pros:</strong> Better load balancing, prevents expert starvation, encourages exploration.</li><li><strong>Cons:</strong> Stochastic (non-deterministic routing), can make debugging harder, might not pick the absolute &ldquo;best&rdquo; expert in a single instance (but better for long-term training).</li></ul><p><strong>Key Takeaway:</strong> While deterministic Top-K is simpler and directly picks the &ldquo;highest-scoring&rdquo; experts, sampling from the softmax offers a more robust training dynamic by ensuring that all experts receive training data, thereby preventing some experts from becoming unused (&ldquo;dead experts&rdquo;).</p><hr></div><footer></footer></article></section></div><footer class=footer><section class=container>©
+2016 -
+2025
+Eric X. Liu
+<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/23b9adc">[23b9adc]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script></body></html>