From 95df119b6d171467c16ee873ad582bab4c7be452 Mon Sep 17 00:00:00 2001 From: eric Date: Sun, 3 Aug 2025 04:41:31 +0000 Subject: [PATCH] deploy: fd19c595b64f23b1b40978e09bc33c5aa60aa16f --- 404.html | 2 +- about/index.html | 2 +- categories/index.html | 2 +- index.html | 2 +- index.xml | 7 +++-- .../index.html | 2 +- .../index.html | 23 ++++++++++++++ posts/index.html | 5 ++-- posts/index.xml | 7 +++-- .../index.html | 30 +++++++++---------- .../index.html | 2 +- posts/useful/index.html | 2 +- sitemap.xml | 2 +- tags/index.html | 2 +- 14 files changed, 58 insertions(+), 32 deletions(-) create mode 100644 posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html diff --git a/404.html b/404.html index 7115a35..e7071f9 100644 --- a/404.html +++ b/404.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[f90b459] \ No newline at end of file +[fd19c59] \ No newline at end of file diff --git a/about/index.html b/about/index.html index 39a7e83..8987bb9 100644 --- a/about/index.html +++ b/about/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[f90b459] \ No newline at end of file +[fd19c59] \ No newline at end of file diff --git a/categories/index.html b/categories/index.html index a7aa821..a9442ff 100644 --- a/categories/index.html +++ b/categories/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[f90b459] \ No newline at end of file +[fd19c59] \ No newline at end of file diff --git a/index.html b/index.html index cc71cd3..8665c9d 100644 --- a/index.html +++ b/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[f90b459] \ No newline at end of file +[fd19c59] \ No newline at end of file diff --git a/index.xml b/index.xml index 116d8b2..d8dec86 100644 --- a/index.xml +++ b/index.xml @@ -1,4 +1,4 @@ -Eric X. Liu's Personal Page/Recent content on Eric X. Liu's Personal PageHugoenSun, 03 Aug 2025 03:41:10 +0000A Deep Dive into PPO for Language Models/posts/a-deep-dive-into-ppo-for-language-models/Sat, 02 Aug 2025 00:00:00 +0000/posts/a-deep-dive-into-ppo-for-language-models/<p>Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don&rsquo;t inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).</p> +Eric X. Liu's Personal Page/Recent content on Eric X. Liu's Personal PageHugoenSun, 03 Aug 2025 04:20:20 +0000A Deep Dive into PPO for Language Models/posts/a-deep-dive-into-ppo-for-language-models/Sat, 02 Aug 2025 00:00:00 +0000/posts/a-deep-dive-into-ppo-for-language-models/<p>Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don&rsquo;t inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).</p> <p>You may have seen diagrams like the one below, which outlines the RLHF training process. It can look intimidating, with a web of interconnected models, losses, and data flows.</p>Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/Wed, 02 Jul 2025 00:00:00 +0000/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/<p>Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called &ldquo;experts&rdquo;) to specialize in different types of inputs. A &ldquo;gating network&rdquo; or &ldquo;router&rdquo; learns to dispatch each input (or &ldquo;token&rdquo;) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.</p> <h3 id="1-challenge-non-differentiability-of-routing-functions"> 1. Challenge: Non-Differentiability of Routing Functions @@ -8,8 +8,9 @@ </a> </h3> <p><strong>The Problem:</strong> -Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discrete, hard selection process. A common function is <code>KeepTopK(v, k)</code>, which selects the top <code>k</code> scoring elements from a vector <code>v</code> and sets others to $-\infty$ or $0$.</p>An Architectural Deep Dive of T5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/Sun, 01 Jun 2025 00:00:00 +0000/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/<p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.</p> -<p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or <strong>Text-to-Text Transfer Transformer</strong>, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.</p>Some useful files/posts/useful/Mon, 26 Oct 2020 04:14:43 +0000/posts/useful/<ul> +Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discrete, hard selection process. A common function is <code>KeepTopK(v, k)</code>, which selects the top <code>k</code> scoring elements from a vector <code>v</code> and sets others to (-\infty) or (0).</p>An Architectural Deep Dive of T5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/Sun, 01 Jun 2025 00:00:00 +0000/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/<p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.</p> +<p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or <strong>Text-to-Text Transfer Transformer</strong>, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.</p>Mastering Your Breville Barista Pro: The Ultimate Guide to Dialing In Espresso/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/Thu, 01 May 2025 00:00:00 +0000/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/<p>Are you ready to transform your home espresso game from good to genuinely great? The Breville Barista Pro is a fantastic machine, but unlocking its full potential requires understanding a few key principles. This guide will walk you through the systematic process of dialing in your espresso, ensuring every shot is delicious and repeatable.</p> +<p>Our overarching philosophy is simple: <strong>isolate and change only one variable at a time.</strong> While numbers are crucial, your palate is the ultimate judge. Dose, ratio, and time are interconnected, but your <strong>grind size</strong> is your most powerful lever.</p>Some useful files/posts/useful/Mon, 26 Oct 2020 04:14:43 +0000/posts/useful/<ul> <li><a href="https://ericxliu.me/rootCA.pem" class="external-link" target="_blank" rel="noopener">rootCA.pem</a></li> <li><a href="https://ericxliu.me/vpnclient.ovpn" class="external-link" target="_blank" rel="noopener">vpnclient.ovpn</a></li> </ul>About/about/Fri, 01 Jun 2018 07:13:52 +0000/about/ \ No newline at end of file diff --git a/posts/a-deep-dive-into-ppo-for-language-models/index.html b/posts/a-deep-dive-into-ppo-for-language-models/index.html index 4e5ff5b..4d07b36 100644 --- a/posts/a-deep-dive-into-ppo-for-language-models/index.html +++ b/posts/a-deep-dive-into-ppo-for-language-models/index.html @@ -23,4 +23,4 @@ where δ_t = r_t + γV(s_{t+1}) - V(s_t)

  • γ (gam 2016 - 2025 Eric X. Liu -[f90b459] \ No newline at end of file +[fd19c59] \ No newline at end of file diff --git a/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html b/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html new file mode 100644 index 0000000..905c436 --- /dev/null +++ b/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html @@ -0,0 +1,23 @@ +Mastering Your Breville Barista Pro: The Ultimate Guide to Dialing In Espresso · Eric X. Liu's Personal Page

    Mastering Your Breville Barista Pro: The Ultimate Guide to Dialing In Espresso

    Are you ready to transform your home espresso game from good to genuinely great? The Breville Barista Pro is a fantastic machine, but unlocking its full potential requires understanding a few key principles. This guide will walk you through the systematic process of dialing in your espresso, ensuring every shot is delicious and repeatable.

    Our overarching philosophy is simple: isolate and change only one variable at a time. While numbers are crucial, your palate is the ultimate judge. Dose, ratio, and time are interconnected, but your grind size is your most powerful lever.

    Let’s dive in!


    Part 1: The Foundation — Dose (The Weight of Dry Coffee) + +Link to heading

    Your dose is the bedrock of your espresso. It’s the weight of your ground coffee, and it should be the first variable you set and then keep constant during the initial dialing-in process.

    Why Dose Matters:

    • Basket Size is Key: Your portafilter basket dictates your ideal dose. Too little coffee (under-dosing) creates excessive “headspace,” leading to soupy extractions. Too much (over-dosing) causes the coffee puck to touch the shower screen, preventing even water flow and causing channeling.
    • Extraction “Work”: A higher dose means more coffee mass, requiring more “work” (a finer grind, more water) to extract properly.
    • Coffee Type:
      • Light Roasts: Denser and harder to extract. Consider a slightly lower dose.
      • Dark Roasts: More brittle and soluble. You can often use a slightly higher dose.

    Application for Your Breville Barista Pro (54mm Portafilter):

    • Your Starting Point: Always begin with 18 grams. Use a scale for accuracy!
    • Adjusting for Roast: For light roasts, if you’re struggling, drop to 17g. For dark roasts, you can try 19g.
    • Golden Rule: Once you choose your starting dose (e.g., 18g), do not change it until you’ve dialed in your grind size.

    Part 2: Defining the Drink — Brew Ratio (Dose vs. Yield) + +Link to heading

    The brew ratio defines the relationship between your dry coffee dose and the weight of your liquid espresso yield. Always measure by weight (grams), not volume (mL), as crema can be inconsistent.

    Understanding Ratios:

    • Ristretto (1:1 – 1:1.5): E.g., 18g in → 18g to 27g out. Strong, textured, less extracted.
    • Espresso (Normale) (1:1.5 – 1:2.5): E.g., 18g in → 27g to 45g out. The standard, balanced shot.
    • Lungo (1:2.5+): E.g., 18g in → 45g+ out. Weaker, less textured, more extracted.

    The Fundamental Trade-Off:

    • Longer Ratio (more water): Higher extraction, but lower strength (more diluted).
    • Shorter Ratio (less water): Lower extraction, but higher strength (more concentrated).

    Application for Your Breville Barista Pro:

    • Recommended Starting Ratio: A 1:2 ratio is the perfect place to begin.
    • Practical Numbers: With your 18g dose, your target yield is 36 grams of liquid espresso.
    • Execution: Place your cup on a scale and use the manual brew function to stop the shot precisely when the scale reads 36g.

    Part 3: The Diagnostic Tool — Brew Time + +Link to heading

    Brew time is not something you set directly; it’s the result of how much resistance your coffee puck provides against the machine’s water pressure. Think of it as a diagnostic tool.

    The 25-30 Second Guideline:

    This is a benchmark. If your 1:2 ratio shot falls within this time, your grind size is likely in the correct range for a balanced extraction.

    • Too Fast (<25s): Indicates under-extraction (often tastes sour).
    • Too Slow (>30s): Indicates over-extraction (often tastes bitter).

    Taste is King: Remember, if a shot tastes fantastic at 32 seconds, it’s a great shot! The time simply becomes part of your successful recipe for that specific coffee.

    Application for Your Breville Barista Pro:

    • Pre-infusion: The Barista Pro’s low-pressure pre-infusion is part of your total brew time. Its purpose is to saturate the puck evenly to prevent channeling. Keep it consistent for every shot while dialing in.

    Part 4: The Primary Control — Grind Setting + +Link to heading

    This is where the magic (and sometimes frustration) happens. Grind size is your main tool for controlling the resistance of the coffee puck, which directly dictates your brew time.

    The Dual Impact of Grinding Finer:

    1. Increases surface area: Allows for more efficient flavor extraction.
    2. Increases resistance: Slows down water flow and increases contact time.

    The Risk of Grinding Too Fine (Channeling):

    If the grind is too fine, the puck becomes so dense that high-pressure water can’t flow evenly. Instead, it “breaks” the puck and punches an easy path (a channel) through a weak spot. This results in a disastrous shot that is simultaneously:

    • Under-extracted: Most of the coffee is bypassed.
    • Over-extracted: The water that does flow blasts through the channel, extracting harsh, bitter compounds.
    • The Taste: A channeled shot tastes hollow, weak, sour, and bitter all at once.

    The Goal: You want to grind as fine as you possibly can without causing significant channeling. This is the sweet spot for maximizing surface area and resistance for high, even extraction.

    Grind Retention (Purging): Most grinders retain some old grounds. When you change your grind setting, always purge a few grams of coffee to ensure your dose is entirely at the new setting.

    Application for Your Breville Barista Pro:

    • Grinder Mechanism: The “Grind Amount” dial controls the TIME the grinder runs, not the weight. When you adjust the fineness, you must re-adjust the grind time to ensure you are still getting your target 18g dose.
    • Tackling Channeling: The Barista Pro is prone to channeling. To fight this, focus on excellent puck prep: use a WDT (Weiss Distribution Technique) tool to break up clumps and evenly distribute the grounds before tamping levelly.

    The Complete Dialing-In Workflow + +Link to heading

    This systematic process will get you to a delicious shot from your Breville Barista Pro efficiently:

    1. Set Your Constants:
      • Dose: 18g.
      • Ratio: 1:2 (meaning a Yield of 36g).
      • Pre-infusion: Use a consistent method (e.g., manual 8-second hold).
    2. Make an Initial Grind:
      • Set the grinder to a starting point of 15.
      • Adjust the grind time until the grinder dispenses exactly 18g.
    3. Pull the First Shot:
      • Brew manually, stopping at 36g of liquid in the cup. Note the total brew time.
    4. Taste and Diagnose:
      • Fast & Sour? (<25s): Grind is too coarse.
      • Slow & Bitter? (>32s): Grind is too fine.
    5. Make ONE Adjustment - THE GRIND SIZE:
      • If fast/sour, adjust the grind finer (e.g., from 15 down to 13).
      • If slow/bitter, adjust the grind coarser (e.g., from 15 up to 17).
    6. Re-adjust and Repeat:
      • After changing the grind setting, purge a small amount of coffee.
      • Re-weigh your next dose and adjust the grind time to get back to exactly 18g.
      • Pull another 36g shot. Repeat this process until your shot tastes balanced and the time falls roughly between 25-32 seconds.

    Happy brewing! With patience and this systematic approach, you’ll be pulling consistently delicious espresso shots from your Breville Barista Pro in no time.

    \ No newline at end of file diff --git a/posts/index.html b/posts/index.html index 40f0471..309a23e 100644 --- a/posts/index.html +++ b/posts/index.html @@ -3,9 +3,10 @@ \ No newline at end of file +[fd19c59] \ No newline at end of file diff --git a/posts/index.xml b/posts/index.xml index 143dc20..1876b5e 100644 --- a/posts/index.xml +++ b/posts/index.xml @@ -1,4 +1,4 @@ -Posts on Eric X. Liu's Personal Page/posts/Recent content in Posts on Eric X. Liu's Personal PageHugoenSun, 03 Aug 2025 03:41:10 +0000A Deep Dive into PPO for Language Models/posts/a-deep-dive-into-ppo-for-language-models/Sat, 02 Aug 2025 00:00:00 +0000/posts/a-deep-dive-into-ppo-for-language-models/<p>Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don&rsquo;t inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).</p> +Posts on Eric X. Liu's Personal Page/posts/Recent content in Posts on Eric X. Liu's Personal PageHugoenSun, 03 Aug 2025 04:20:20 +0000A Deep Dive into PPO for Language Models/posts/a-deep-dive-into-ppo-for-language-models/Sat, 02 Aug 2025 00:00:00 +0000/posts/a-deep-dive-into-ppo-for-language-models/<p>Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don&rsquo;t inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).</p> <p>You may have seen diagrams like the one below, which outlines the RLHF training process. It can look intimidating, with a web of interconnected models, losses, and data flows.</p>Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/Wed, 02 Jul 2025 00:00:00 +0000/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/<p>Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called &ldquo;experts&rdquo;) to specialize in different types of inputs. A &ldquo;gating network&rdquo; or &ldquo;router&rdquo; learns to dispatch each input (or &ldquo;token&rdquo;) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.</p> <h3 id="1-challenge-non-differentiability-of-routing-functions"> 1. Challenge: Non-Differentiability of Routing Functions @@ -8,8 +8,9 @@ </a> </h3> <p><strong>The Problem:</strong> -Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discrete, hard selection process. A common function is <code>KeepTopK(v, k)</code>, which selects the top <code>k</code> scoring elements from a vector <code>v</code> and sets others to $-\infty$ or $0$.</p>An Architectural Deep Dive of T5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/Sun, 01 Jun 2025 00:00:00 +0000/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/<p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.</p> -<p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or <strong>Text-to-Text Transfer Transformer</strong>, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.</p>Some useful files/posts/useful/Mon, 26 Oct 2020 04:14:43 +0000/posts/useful/<ul> +Many routing mechanisms, especially &ldquo;Top-K routing,&rdquo; involve a discrete, hard selection process. A common function is <code>KeepTopK(v, k)</code>, which selects the top <code>k</code> scoring elements from a vector <code>v</code> and sets others to (-\infty) or (0).</p>An Architectural Deep Dive of T5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/Sun, 01 Jun 2025 00:00:00 +0000/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/<p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.</p> +<p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or <strong>Text-to-Text Transfer Transformer</strong>, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.</p>Mastering Your Breville Barista Pro: The Ultimate Guide to Dialing In Espresso/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/Thu, 01 May 2025 00:00:00 +0000/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/<p>Are you ready to transform your home espresso game from good to genuinely great? The Breville Barista Pro is a fantastic machine, but unlocking its full potential requires understanding a few key principles. This guide will walk you through the systematic process of dialing in your espresso, ensuring every shot is delicious and repeatable.</p> +<p>Our overarching philosophy is simple: <strong>isolate and change only one variable at a time.</strong> While numbers are crucial, your palate is the ultimate judge. Dose, ratio, and time are interconnected, but your <strong>grind size</strong> is your most powerful lever.</p>Some useful files/posts/useful/Mon, 26 Oct 2020 04:14:43 +0000/posts/useful/<ul> <li><a href="https://ericxliu.me/rootCA.pem" class="external-link" target="_blank" rel="noopener">rootCA.pem</a></li> <li><a href="https://ericxliu.me/vpnclient.ovpn" class="external-link" target="_blank" rel="noopener">vpnclient.ovpn</a></li> </ul> \ No newline at end of file diff --git a/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html b/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html index 7d9c9ad..3917387 100644 --- a/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html +++ b/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html @@ -7,9 +7,9 @@ The Problem: -Many routing mechanisms, especially “Top-K routing,” involve a discrete, hard selection process. A common function is KeepTopK(v, k), which selects the top k scoring elements from a vector v and sets others to $-\infty$ or $0$.">

2. Challenge: Uneven Expert Utilization (Balancing Loss) Link to heading

The Problem: Left unchecked, the gating network might learn to heavily favor a few experts, leaving others underutilized. This leads to:

  • System Inefficiency: Overloaded experts become bottlenecks, while underutilized experts waste computational resources.
  • Suboptimal Learning: Experts might not specialize effectively if they don’t receive diverse data.

Solution: Heuristic Balancing Losses (e.g., from Switch Transformer, Fedus et al. 2022) -An auxiliary loss is added to the total model loss during training to encourage more even expert usage.

$$ \text{loss}{\text{auxiliary}} = \alpha \cdot N \cdot \sum{i=1}^{N} f_i \cdot P_i $$

Where:

  • $\alpha$: A hyperparameter controlling the strength of the auxiliary loss.
  • $N$: Total number of experts.
  • $f_i$: The fraction of tokens actually dispatched to expert $i$ in the current batch $B$. -$$ f_i = \frac{1}{T} \sum_{x \in B} \mathbf{1}{\text{argmax } p(x) = i} $$ -($p(x)$ here refers to the output of the gating network, which could be $s_{i,t}$ in the DeepSeek/classic router. The $\text{argmax}$ means it counts hard assignments to expert $i$.)
  • $P_i$: The fraction of the router probability mass allocated to expert $i$ in the current batch $B$. -$$ P_i = \frac{1}{T} \sum_{x \in B} p_i(x) $$ -($p_i(x)$ is the learned probability (or soft score) from the gating network for token $x$ and expert $i$.)

How it works: -The loss aims to minimize the product $f_i \cdot P_i$ when $f_i$ and $P_i$ are small, effectively pushing them to be larger (closer to $1/N$). If an expert $i$ is overused (high $f_i$ and $P_i$), its term in the sum contributes significantly to the loss. The derivative with respect to $p_i(x)$ reveals that “more frequent use = stronger downweighting,” meaning the gating network is penalized for sending too much traffic to an already busy expert.

Relationship to Gating Network:

  • $p_i(x)$ (or $s_{i,t}$): This is the output of the learned gating network (e.g., from a linear layer followed by Softmax). The gating network’s parameters are updated via gradient descent, influenced by this auxiliary loss.
  • $P_i$: This is calculated from the outputs of the learned gating network for the current batch. It’s not a pre-defined value.

Limitation (“Second Best” Scenario): -Even with this loss, an expert can remain imbalanced if it’s consistently the “second best” option (high $P_i$) but never the absolute top choice that gets counted in $f_i$ (especially if $K=1$). This is because $f_i$ strictly counts hard assignments based on argmax. This limitation highlights why “soft” routing or “softmax after TopK” approaches can be more effective for truly even distribution.

3. Challenge: Overfitting during Fine-tuning +An auxiliary loss is added to the total model loss during training to encourage more even expert usage.

(( \text{loss}{\text{auxiliary}} = \alpha \cdot N \cdot \sum{i=1}^{N} f_i \cdot P_i ))

Where:

  • (\alpha): A hyperparameter controlling the strength of the auxiliary loss.
  • (N): Total number of experts.
  • (f_i): The fraction of tokens actually dispatched to expert (i) in the current batch (B). +(( f_i = \frac{1}{T} \sum_{x \in B} \mathbf{1}{\text{argmax } p(x) = i} )) +((p(x)) here refers to the output of the gating network, which could be (s_{i,t}) in the DeepSeek/classic router. The (\text{argmax}) means it counts hard assignments to expert (i).)
  • (P_i): The fraction of the router probability mass allocated to expert (i) in the current batch (B). +(( P_i = \frac{1}{T} \sum_{x \in B} p_i(x) )) +((p_i(x)) is the learned probability (or soft score) from the gating network for token (x) and expert (i).)

How it works: +The loss aims to minimize the product (f_i \cdot P_i) when (f_i) and (P_i) are small, effectively pushing them to be larger (closer to (1/N)). If an expert (i) is overused (high (f_i) and (P_i)), its term in the sum contributes significantly to the loss. The derivative with respect to (p_i(x)) reveals that “more frequent use = stronger downweighting,” meaning the gating network is penalized for sending too much traffic to an already busy expert.

Relationship to Gating Network:

  • (p_i(x)) (or (s_{i,t})): This is the output of the learned gating network (e.g., from a linear layer followed by Softmax). The gating network’s parameters are updated via gradient descent, influenced by this auxiliary loss.
  • (P_i): This is calculated from the outputs of the learned gating network for the current batch. It’s not a pre-defined value.

Limitation (“Second Best” Scenario): +Even with this loss, an expert can remain imbalanced if it’s consistently the “second best” option (high (P_i)) but never the absolute top choice that gets counted in (f_i) (especially if (K=1)). This is because (f_i) strictly counts hard assignments based on argmax. This limitation highlights why “soft” routing or “softmax after TopK” approaches can be more effective for truly even distribution.

3. Challenge: Overfitting during Fine-tuning Link to heading

The Problem: Sparse MoE models, despite only activating a few experts per token, possess a very large total number of parameters. When fine-tuning these models on smaller datasets, they are highly prone to overfitting. The model’s vast capacity allows it to memorize the limited fine-tuning data, leading to poor generalization performance on unseen validation data. This is evident when training loss continues to decrease, but validation loss stagnates or increases.

Solutions:

  • Zoph et al. Solution – Fine-tune non-MoE MLPs:

    • This strategy involves freezing a portion of the MoE model’s parameters during fine-tuning, specifically the large expert weights.
    • Instead, only the “non-MoE” parameters (e.g., attention layers, adapter layers, or the gating network itself) are updated.
    • This reduces the effective number of trainable parameters during fine-tuning, thereby mitigating the risk of overfitting on small datasets. It assumes the experts are already well-pre-trained for general tasks.
  • DeepSeek Solution – Use Lots of Data (1.4M SFT):

    • This approach tackles the problem by providing the model with a very large and diverse dataset for Supervised Fine-Tuning (SFT).
    • With abundant data (e.g., 1.4 million examples covering a wide range of tasks and languages), the model’s large capacity can be effectively utilized for specialized learning rather than memorization. The diversity and volume of data prevent individual experts from overfitting to specific examples.

Conclusion: @@ -44,4 +44,4 @@ The Top-K routing mechanism, as illustrated in the provided ima 2016 - 2025 Eric X. Liu -[f90b459] \ No newline at end of file +[fd19c59] \ No newline at end of file diff --git a/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html index e33b66a..52f0dd5 100644 --- a/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html +++ b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html @@ -30,4 +30,4 @@ But to truly understand the field, we must look at the pivotal models that explo 2016 - 2025 Eric X. Liu -[f90b459] \ No newline at end of file +[fd19c59] \ No newline at end of file diff --git a/posts/useful/index.html b/posts/useful/index.html index f04864e..2e98a9c 100644 --- a/posts/useful/index.html +++ b/posts/useful/index.html @@ -10,4 +10,4 @@ One-minute read

  • [f90b459] \ No newline at end of file +[fd19c59] \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 60dfc1b..97059f8 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -1 +1 @@ -/posts/a-deep-dive-into-ppo-for-language-models/2025-08-03T03:28:39+00:00weekly0.5/2025-08-03T03:41:10+00:00weekly0.5/posts/2025-08-03T03:41:10+00:00weekly0.5/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/2025-08-03T03:28:39+00:00weekly0.5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/2025-08-03T03:41:10+00:00weekly0.5/posts/useful/2020-10-26T04:47:36+00:00weekly0.5/about/2020-06-16T23:30:17-07:00weekly0.5/categories/weekly0.5/tags/weekly0.5 \ No newline at end of file +/posts/a-deep-dive-into-ppo-for-language-models/2025-08-03T03:28:39+00:00weekly0.5/2025-08-03T04:20:20+00:00weekly0.5/posts/2025-08-03T04:20:20+00:00weekly0.5/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/2025-08-03T03:49:59+00:00weekly0.5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/2025-08-03T03:41:10+00:00weekly0.5/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/2025-08-03T04:20:20+00:00weekly0.5/posts/useful/2020-10-26T04:47:36+00:00weekly0.5/about/2020-06-16T23:30:17-07:00weekly0.5/categories/weekly0.5/tags/weekly0.5 \ No newline at end of file diff --git a/tags/index.html b/tags/index.html index eadebbc..b4f7021 100644 --- a/tags/index.html +++ b/tags/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[f90b459] \ No newline at end of file +[fd19c59] \ No newline at end of file