This commit is contained in:
eric
2025-08-03 06:15:25 +00:00
parent a50fee0dcf
commit a9192dd7da
14 changed files with 37 additions and 37 deletions

View File

@@ -7,9 +7,9 @@
The Problem:
Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discrete, hard selection process. A common function is KeepTopK(v, k), which selects the top k scoring elements from a vector v and sets others to (-\infty) or (0)."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=twitter:card content="summary"><meta name=twitter:title content="Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice"><meta name=twitter:description content="Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called “experts”) to specialize in different types of inputs. A “gating network” or “router” learns to dispatch each input (or “token”) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.
1. Challenge: Non-Differentiability of Routing Functions Link to heading The Problem: Many routing mechanisms, especially “Top-K routing,” involve a discrete, hard selection process. A common function is KeepTopK(v, k), which selects the top k scoring elements from a vector v and sets others to (-\infty) or (0)."><meta property="og:url" content="/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice"><meta property="og:description" content="Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called “experts”) to specialize in different types of inputs. A “gating network” or “router” learns to dispatch each input (or “token”) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.
1. Challenge: Non-Differentiability of Routing Functions Link to heading The Problem: Many routing mechanisms, especially “Top-K routing,” involve a discrete, hard selection process. A common function is KeepTopK(v, k), which selects the top k scoring elements from a vector v and sets others to (-\infty) or (0)."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2025-07-02T00:00:00+00:00"><meta property="article:modified_time" content="2025-08-03T03:49:59+00:00"><link rel=canonical href=/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.6445a802b9389c9660e1b07b724dcf5718b1065ed2d71b4eeaf981cc7cc5fc46.css integrity="sha256-ZEWoArk4nJZg4bB7ck3PVxixBl7S1xtO6vmBzHzF/EY=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=/>Eric X. Liu's Personal Page
Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discrete, hard selection process. A common function is KeepTopK(v, k), which selects the top k scoring elements from a vector v and sets others to $-\infty$ or $0$."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=twitter:card content="summary"><meta name=twitter:title content="Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice"><meta name=twitter:description content="Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called “experts”) to specialize in different types of inputs. A “gating network” or “router” learns to dispatch each input (or “token”) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.
1. Challenge: Non-Differentiability of Routing Functions Link to heading The Problem: Many routing mechanisms, especially “Top-K routing,” involve a discrete, hard selection process. A common function is KeepTopK(v, k), which selects the top k scoring elements from a vector v and sets others to $-\infty$ or $0$."><meta property="og:url" content="/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice"><meta property="og:description" content="Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called “experts”) to specialize in different types of inputs. A “gating network” or “router” learns to dispatch each input (or “token”) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.
1. Challenge: Non-Differentiability of Routing Functions Link to heading The Problem: Many routing mechanisms, especially “Top-K routing,” involve a discrete, hard selection process. A common function is KeepTopK(v, k), which selects the top k scoring elements from a vector v and sets others to $-\infty$ or $0$."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2025-07-02T00:00:00+00:00"><meta property="article:modified_time" content="2025-08-03T06:02:48+00:00"><link rel=canonical href=/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.6445a802b9389c9660e1b07b724dcf5718b1065ed2d71b4eeaf981cc7cc5fc46.css integrity="sha256-ZEWoArk4nJZg4bB7ck3PVxixBl7S1xtO6vmBzHzF/EY=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=/>Eric X. Liu's Personal Page
</a><input type=checkbox id=menu-toggle>
<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=https://rss.ericxliu.me/oauth2/oidc/redirect>RSS</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/>Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
<time datetime=2025-07-02T00:00:00Z>July 2, 2025
@@ -17,31 +17,31 @@ Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discr
7-minute read</span></div></div></header><div class=post-content><p>Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called &ldquo;experts&rdquo;) to specialize in different types of inputs. A &ldquo;gating network&rdquo; or &ldquo;router&rdquo; learns to dispatch each input (or &ldquo;token&rdquo;) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.</p><h3 id=1-challenge-non-differentiability-of-routing-functions>1. Challenge: Non-Differentiability of Routing Functions
<a class=heading-link href=#1-challenge-non-differentiability-of-routing-functions><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p><strong>The Problem:</strong>
Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discrete, hard selection process. A common function is <code>KeepTopK(v, k)</code>, which selects the top <code>k</code> scoring elements from a vector <code>v</code> and sets others to (-\infty) or (0).</p><p>[
KeepTopK(v, k)_i = \begin{cases} v_i & \text{if } v_i \text{ is in the top } k \text{ elements of } v \ -\infty & \text{otherwise.} \end{cases}
]</p><p>This function is <strong>not differentiable</strong>. Its gradient is zero almost everywhere and undefined at the threshold points, making it impossible to directly train the gating network&rsquo;s parameters (e.g., (W_g)) using standard gradient descent.</p><p><strong>Solutions (Stochastic Approximations):</strong>
Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discrete, hard selection process. A common function is <code>KeepTopK(v, k)</code>, which selects the top <code>k</code> scoring elements from a vector <code>v</code> and sets others to $-\infty$ or $0$.</p>$$
KeepTopK(v, k)_i = \begin{cases} v_i & \text{if } v_i \text{ is in the top } k \text{ elements of } v \\ -\infty & \text{otherwise.} \end{cases}
$$<p>This function is <strong>not differentiable</strong>. Its gradient is zero almost everywhere and undefined at the threshold points, making it impossible to directly train the gating network&rsquo;s parameters (e.g., $W_g$) using standard gradient descent.</p><p><strong>Solutions (Stochastic Approximations):</strong>
To enable end-to-end training, non-differentiable routing decisions must be approximated with differentiable or stochastic methods.</p><ul><li><p><strong>Stochastic Scoring (e.g., Shazeer et al. 2017):</strong>
The expert score (H(x)_i = (x \cdot W_g)<em>i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W</em>{noise})_i)) introduces Gaussian noise. This makes the scores themselves stochastic, which can be leveraged with other methods.</p></li><li><p><strong>Gumbel-Softmax Trick (or Concrete Distribution):</strong>
The expert score $H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{noise})_i)$ introduces Gaussian noise. This makes the scores themselves stochastic, which can be leveraged with other methods.</p></li><li><p><strong>Gumbel-Softmax Trick (or Concrete Distribution):</strong>
This method allows for differentiable sampling from categorical distributions. Instead of directly picking the top-k, Gumbel noise is added to the scores, and a Softmax (with a temperature parameter) is applied. This provides a continuous, differentiable approximation of a discrete choice, allowing gradients to flow back.</p></li><li><p><strong>REINFORCE (Score Function Estimator):</strong>
This is a policy gradient method from reinforcement learning. The routing decision is treated as an action, and the gating network&rsquo;s parameters are updated based on the &ldquo;reward&rdquo; (e.g., the model&rsquo;s performance). Gradients are estimated by sampling routing choices and weighting them by their outcomes.</p></li><li><p><strong>Straight-Through Estimator (STE):</strong>
A simpler approximation where, during the backward pass, gradients are treated as if the non-differentiable operation was an identity function or a simple smooth function.</p></li><li><p><strong>Softmax after TopK (e.g., Mixtral, DBRX, DeepSeek v3):</strong>
Instead of <code>Softmax(KeepTopK(...))</code>, some models apply a Softmax <em>only to the scores of the selected TopK experts</em>, and then assign (0) to the rest. This provides differentiable weights for the selected experts while still enforcing sparsity.</p></li></ul><h3 id=2-challenge-uneven-expert-utilization-balancing-loss>2. Challenge: Uneven Expert Utilization (Balancing Loss)
Instead of <code>Softmax(KeepTopK(...))</code>, some models apply a Softmax <em>only to the scores of the selected TopK experts</em>, and then assign $0$ to the rest. This provides differentiable weights for the selected experts while still enforcing sparsity.</p></li></ul><h3 id=2-challenge-uneven-expert-utilization-balancing-loss>2. Challenge: Uneven Expert Utilization (Balancing Loss)
<a class=heading-link href=#2-challenge-uneven-expert-utilization-balancing-loss><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p><strong>The Problem:</strong>
Left unchecked, the gating network might learn to heavily favor a few experts, leaving others underutilized. This leads to:</p><ul><li><strong>System Inefficiency:</strong> Overloaded experts become bottlenecks, while underutilized experts waste computational resources.</li><li><strong>Suboptimal Learning:</strong> Experts might not specialize effectively if they don&rsquo;t receive diverse data.</li></ul><p><strong>Solution: Heuristic Balancing Losses (e.g., from Switch Transformer, Fedus et al. 2022)</strong>
An auxiliary loss is added to the total model loss during training to encourage more even expert usage.</p><p>(( \text{loss}<em>{\text{auxiliary}} = \alpha \cdot N \cdot \sum</em>{i=1}^{N} f_i \cdot P_i ))</p><p>Where:</p><ul><li>(\alpha): A hyperparameter controlling the strength of the auxiliary loss.</li><li>(N): Total number of experts.</li><li>(f_i): The <strong>fraction of tokens <em>actually dispatched</em> to expert (i)</strong> in the current batch (B).
(( f_i = \frac{1}{T} \sum_{x \in B} \mathbf{1}{\text{argmax } p(x) = i} ))
((p(x)) here refers to the output of the gating network, which could be (s_{i,t}) in the DeepSeek/classic router. The (\text{argmax}) means it counts hard assignments to expert (i).)</li><li>(P_i): The <strong>fraction of the router <em>probability mass</em> allocated to expert (i)</strong> in the current batch (B).
(( P_i = \frac{1}{T} \sum_{x \in B} p_i(x) ))
((p_i(x)) is the learned probability (or soft score) from the gating network for token (x) and expert (i).)</li></ul><p><strong>How it works:</strong>
The loss aims to minimize the product (f_i \cdot P_i) when (f_i) and (P_i) are small, effectively pushing them to be larger (closer to (1/N)). If an expert (i) is overused (high (f_i) and (P_i)), its term in the sum contributes significantly to the loss. The derivative with respect to (p_i(x)) reveals that &ldquo;more frequent use = stronger downweighting,&rdquo; meaning the gating network is penalized for sending too much traffic to an already busy expert.</p><p><strong>Relationship to Gating Network:</strong></p><ul><li><strong>(p_i(x)) (or (s_{i,t})):</strong> This is the output of the <strong>learned gating network</strong> (e.g., from a linear layer followed by Softmax). The gating network&rsquo;s parameters are updated via gradient descent, influenced by this auxiliary loss.</li><li><strong>(P_i):</strong> This is <em>calculated</em> from the outputs of the learned gating network for the current batch. It&rsquo;s not a pre-defined value.</li></ul><p><strong>Limitation (&ldquo;Second Best&rdquo; Scenario):</strong>
Even with this loss, an expert can remain imbalanced if it&rsquo;s consistently the &ldquo;second best&rdquo; option (high (P_i)) but never the <em>absolute top choice</em> that gets counted in (f_i) (especially if (K=1)). This is because (f_i) strictly counts hard assignments based on <code>argmax</code>. This limitation highlights why &ldquo;soft&rdquo; routing or &ldquo;softmax after TopK&rdquo; approaches can be more effective for truly even distribution.</p><h3 id=3-challenge-overfitting-during-fine-tuning>3. Challenge: Overfitting during Fine-tuning
An auxiliary loss is added to the total model loss during training to encourage more even expert usage.</p>$$ \text{loss}_{\text{auxiliary}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i $$<p>Where:</p><ul><li>$\alpha$: A hyperparameter controlling the strength of the auxiliary loss.</li><li>$N$: Total number of experts.</li><li>$f_i$: The <strong>fraction of tokens <em>actually dispatched</em> to expert $i$</strong> in the current batch $B$.
$$ f_i = \frac{1}{T} \sum_{x \in B} \mathbf{1}\{\text{argmax } p(x) = i\} $$
($p(x)$ here refers to the output of the gating network, which could be $s_{i,t}$ in the DeepSeek/classic router. The $\text{argmax}$ means it counts hard assignments to expert $i$.)</li><li>$P_i$: The <strong>fraction of the router <em>probability mass</em> allocated to expert $i$</strong> in the current batch $B$.
$$ P_i = \frac{1}{T} \sum_{x \in B} p_i(x) $$
($p_i(x)$ is the learned probability (or soft score) from the gating network for token $x$ and expert $i$.)</li></ul><p><strong>How it works:</strong>
The loss aims to minimize the product $f_i \cdot P_i$ when $f_i$ and $P_i$ are small, effectively pushing them to be larger (closer to $1/N$). If an expert $i$ is overused (high $f_i$ and $P_i$), its term in the sum contributes significantly to the loss. The derivative with respect to $p_i(x)$ reveals that &ldquo;more frequent use = stronger downweighting,&rdquo; meaning the gating network is penalized for sending too much traffic to an already busy expert.</p><p><strong>Relationship to Gating Network:</strong></p><ul><li><strong>$p_i(x)$ (or $s_{i,t}$):</strong> This is the output of the <strong>learned gating network</strong> (e.g., from a linear layer followed by Softmax). The gating network&rsquo;s parameters are updated via gradient descent, influenced by this auxiliary loss.</li><li><strong>$P_i$:</strong> This is <em>calculated</em> from the outputs of the learned gating network for the current batch. It&rsquo;s not a pre-defined value.</li></ul><p><strong>Limitation (&ldquo;Second Best&rdquo; Scenario):</strong>
Even with this loss, an expert can remain imbalanced if it&rsquo;s consistently the &ldquo;second best&rdquo; option (high $P_i$) but never the <em>absolute top choice</em> that gets counted in $f_i$ (especially if $K=1$). This is because $f_i$ strictly counts hard assignments based on <code>argmax</code>. This limitation highlights why &ldquo;soft&rdquo; routing or &ldquo;softmax after TopK&rdquo; approaches can be more effective for truly even distribution.</p><h3 id=3-challenge-overfitting-during-fine-tuning>3. Challenge: Overfitting during Fine-tuning
<a class=heading-link href=#3-challenge-overfitting-during-fine-tuning><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p><strong>The Problem:</strong>
Sparse MoE models, despite only activating a few experts per token, possess a very large total number of parameters. When fine-tuning these models on <strong>smaller datasets</strong>, they are highly prone to <strong>overfitting</strong>. The model&rsquo;s vast capacity allows it to memorize the limited fine-tuning data, leading to poor generalization performance on unseen validation data. This is evident when training loss continues to decrease, but validation loss stagnates or increases.</p><p><strong>Solutions:</strong></p><ul><li><p><strong>Zoph et al. Solution Fine-tune non-MoE MLPs:</strong></p><ul><li>This strategy involves freezing a portion of the MoE model&rsquo;s parameters during fine-tuning, specifically the large expert weights.</li><li>Instead, only the &ldquo;non-MoE&rdquo; parameters (e.g., attention layers, adapter layers, or the gating network itself) are updated.</li><li>This reduces the effective number of trainable parameters during fine-tuning, thereby mitigating the risk of overfitting on small datasets. It assumes the experts are already well-pre-trained for general tasks.</li></ul></li><li><p><strong>DeepSeek Solution Use Lots of Data (1.4M SFT):</strong></p><ul><li>This approach tackles the problem by providing the model with a very large and diverse dataset for Supervised Fine-Tuning (SFT).</li><li>With abundant data (e.g., 1.4 million examples covering a wide range of tasks and languages), the model&rsquo;s large capacity can be effectively utilized for specialized learning rather than memorization. The diversity and volume of data prevent individual experts from overfitting to specific examples.</li></ul></li></ul><p><strong>Conclusion:</strong>
MoE models offer significant advantages in terms of model capacity and computational efficiency, but their unique sparse activation pattern introduces challenges in training and fine-tuning. Overcoming non-differentiability in routing and ensuring balanced expert utilization are crucial for effective pre-training. During fine-tuning, managing the model&rsquo;s vast parameter count to prevent overfitting on smaller datasets requires either strategic parameter freezing or access to very large and diverse fine-tuning data.
The <strong>Top-K routing</strong> mechanism, as illustrated in the provided image, is a core component in many modern Mixture-of-Experts (MoE) models. It involves selecting a fixed number (<code>K</code>) of experts for each input based on relevance scores.</p><hr><p><strong>Traditional Top-K (Deterministic Selection):</strong></p><ul><li><strong>How it works:</strong><ol><li>Calculate relevance scores (<code>s_{i,t}</code>) for each expert <code>i</code> and input <code>t</code>.</li><li>Identify the <code>K</code> experts with the highest scores.</li><li>Experts <em>within</em> the Top-K are assigned their scores (<code>g_{i,t} = s_{i,t}</code>).</li><li>Experts <em>outside</em> the Top-K are assigned a score of <code>0</code> (<code>g_{i,t} = 0</code>).</li><li>The output is a weighted sum of the selected experts&rsquo; outputs.</li></ol></li><li><strong>Pros:</strong> Predictable, deterministic, selects the &ldquo;best&rdquo; experts based on current scores.</li><li><strong>Cons:</strong> Can lead to expert imbalance, where a few popular experts are always chosen, starving others of training.</li></ul><p><strong>Alternative: Sampling from Softmax (Probabilistic Selection):</strong></p><ul><li><strong>How it works:</strong><ol><li>Calculate relevance scores (<code>s_{i,t}</code>) which are treated as probabilities (after softmax).</li><li><strong>Randomly sample</strong> <code>K</code> unique expert indices from the distribution defined by these probabilities.</li><li>Selected experts contribute; unselected experts do not.</li></ol></li><li><strong>Why it&rsquo;s suggested:</strong><ul><li><strong>Load Balancing:</strong> Prevents expert collapse by ensuring all experts get a chance to be selected, even those with slightly lower scores. This promotes more even training across the entire expert pool.</li><li><strong>Diversity & Exploration:</strong> Introduces randomness, potentially leading to better generalization and robustness by exploring different expert combinations.</li></ul></li><li><strong>Pros:</strong> Better load balancing, prevents expert starvation, encourages exploration.</li><li><strong>Cons:</strong> Stochastic (non-deterministic routing), can make debugging harder, might not pick the absolute &ldquo;best&rdquo; expert in a single instance (but better for long-term training).</li></ul><p><strong>Key Takeaway:</strong> While deterministic Top-K is simpler and directly picks the &ldquo;highest-scoring&rdquo; experts, sampling from the softmax offers a more robust training dynamic by ensuring that all experts receive training data, thereby preventing some experts from becoming unused (&ldquo;dead experts&rdquo;).</p><hr></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article></section></div><footer class=footer><section class=container>©
The <strong>Top-K routing</strong> mechanism, as illustrated in the provided image, is a core component in many modern Mixture-of-Experts (MoE) models. It involves selecting a fixed number (<code>K</code>) of experts for each input based on relevance scores.</p><hr><p><strong>Traditional Top-K (Deterministic Selection):</strong></p><ul><li><strong>How it works:</strong><ol><li>Calculate relevance scores (<code>s_{i,t}</code>) for each expert <code>i</code> and input <code>t</code>.</li><li>Identify the <code>K</code> experts with the highest scores.</li><li>Experts <em>within</em> the Top-K are assigned their scores (<code>g_{i,t} = s_{i,t}</code>).</li><li>Experts <em>outside</em> the Top-K are assigned a score of <code>0</code> (<code>g_{i,t} = 0</code>).</li><li>The output is a weighted sum of the selected experts&rsquo; outputs.</li></ol></li><li><strong>Pros:</strong> Predictable, deterministic, selects the &ldquo;best&rdquo; experts based on current scores.</li><li><strong>Cons:</strong> Can lead to expert imbalance, where a few popular experts are always chosen, starving others of training.</li></ul><p><strong>Alternative: Sampling from Softmax (Probabilistic Selection):</strong></p><ul><li><strong>How it works:</strong><ol><li>Calculate relevance scores (<code>s_{i,t}</code>) which are treated as probabilities (after softmax).</li><li><strong>Randomly sample</strong> <code>K</code> unique expert indices from the distribution defined by these probabilities.</li><li>Selected experts contribute; unselected experts do not.</li></ol></li><li><strong>Why it&rsquo;s suggested:</strong><ul><li><strong>Load Balancing:</strong> Prevents expert collapse by ensuring all experts get a chance to be selected, even those with slightly lower scores. This promotes more even training across the entire expert pool.</li><li><strong>Diversity & Exploration:</strong> Introduces randomness, potentially leading to better generalization and robustness by exploring different expert combinations.</li></ul></li><li><strong>Pros:</strong> Better load balancing, prevents expert starvation, encourages exploration.</li><li><strong>Cons:</strong> Stochastic (non-deterministic routing), can make debugging harder, might not pick the absolute &ldquo;best&rdquo; expert in a single instance (but better for long-term training).</li></ul><p><strong>Key Takeaway:</strong> While deterministic Top-K is simpler and directly picks the &ldquo;highest-scoring&rdquo; experts, sampling from the softmax offers a more robust training dynamic by ensuring that all experts receive training data, thereby preventing some experts from becoming unused (&ldquo;dead experts&rdquo;).</p><hr></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css integrity=sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0 crossorigin=anonymous><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js integrity=sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4 crossorigin=anonymous></script><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js integrity=sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05 crossorigin=anonymous onload='renderMathInElement(document.body,{delimiters:[{left:"$$",right:"$$",display:!0},{left:"$",right:"$",display:!1},{left:"\\(",right:"\\)",display:!1},{left:"\\[",right:"\\]",display:!0}]})'></script></section></div><footer class=footer><section class=container>©
2016 -
2025
Eric X. Liu
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/5706ff7">[5706ff7]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/eba296f">[eba296f]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>