deploy: b6192ca3ca

2025-08-03 02:20:56 +00:00
parent e8ae2242e3
commit 8f3c545991
12 changed files with 48 additions and 12 deletions
--- a/posts/a-deep-dive-into-ppo-for-language-models/index.html
+++ b/posts/a-deep-dive-into-ppo-for-language-models/index.html
@@ -23,4 +23,4 @@ where <code>δ_t = r_t + γV(s_{t+1}) - V(s_t)</code></p><ul><li><strong>γ (gam
 2016 -
 2025
 Eric X. Liu
-<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/a3ccac4">[a3ccac4]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script></body></html>
+<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/b6192ca">[b6192ca]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script></body></html>
--- a/posts/index.html
+++ b/posts/index.html
@@ -1,9 +1,10 @@
 <!doctype html><html lang=en><head><title>Posts · Eric X. Liu's Personal Page</title><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=color-scheme content="light dark"><meta name=author content="Eric X. Liu"><meta name=description content="Eric X. Liu - Software & Performance Engineer at Google. Sharing insights about software engineering, performance optimization, tech industry experiences, mountain biking adventures, Jeep overlanding, and outdoor activities."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=fediverse:creator content><meta name=twitter:card content="summary"><meta name=twitter:title content="Posts"><meta name=twitter:description content="Eric X. Liu - Software & Performance Engineer at Google. Sharing insights about software engineering, performance optimization, tech industry experiences, mountain biking adventures, Jeep overlanding, and outdoor activities."><meta property="og:url" content="/posts/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="Posts"><meta property="og:description" content="Eric X. Liu - Software & Performance Engineer at Google. Sharing insights about software engineering, performance optimization, tech industry experiences, mountain biking adventures, Jeep overlanding, and outdoor activities."><meta property="og:locale" content="en"><meta property="og:type" content="website"><link rel=canonical href=/posts/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.60f552a2c0452fcc0254c54f21c3e0728460c1ae85f97a9c35833a222ef8b884.css integrity="sha256-YPVSosBFL8wCVMVPIcPgcoRgwa6F+XqcNYM6Ii74uIQ=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5><link rel=alternate type=application/rss+xml href=/posts/index.xml title="Eric X. Liu's Personal Page"></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=/>Eric X. Liu's Personal Page
 </a><input type=checkbox id=menu-toggle>
-<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=https://rss.ericxliu.me/oauth2/oidc/redirect>RSS</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container list"><header><h1 class=title><a class=title-link href=/posts/>Posts</a></h1></header><ul><li><span class=date>October 26, 2020</span>
+<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=https://rss.ericxliu.me/oauth2/oidc/redirect>RSS</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container list"><header><h1 class=title><a class=title-link href=/posts/>Posts</a></h1></header><ul><li><span class=date>August 3, 2025</span>
+<a class=title href=/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/>T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive</a></li><li><span class=date>October 26, 2020</span>
 <a class=title href=/posts/useful/>Some useful files</a></li><li><span class=date>January 1, 0001</span>
 <a class=title href=/posts/a-deep-dive-into-ppo-for-language-models/></a></li></ul></section></div><footer class=footer><section class=container>©
 2016 -
 2025
 Eric X. Liu
-<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/a3ccac4">[a3ccac4]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script></body></html>
+<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/b6192ca">[b6192ca]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script></body></html>
--- a/posts/index.xml
+++ b/posts/index.xml
@@ -1,4 +1,5 @@
-<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Posts on Eric X. Liu's Personal Page</title><link>/posts/</link><description>Recent content in Posts on Eric X. Liu's Personal Page</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 02 Aug 2025 15:46:24 -0700</lastBuildDate><atom:link href="/posts/index.xml" rel="self" type="application/rss+xml"/><item><title>Some useful files</title><link>/posts/useful/</link><pubDate>Mon, 26 Oct 2020 04:14:43 +0000</pubDate><guid>/posts/useful/</guid><description>&lt;ul>
+<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Posts on Eric X. Liu's Personal Page</title><link>/posts/</link><description>Recent content in Posts on Eric X. Liu's Personal Page</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sun, 03 Aug 2025 01:47:39 +0000</lastBuildDate><atom:link href="/posts/index.xml" rel="self" type="application/rss+xml"/><item><title>T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive</title><link>/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/</link><pubDate>Sun, 03 Aug 2025 01:47:10 +0000</pubDate><guid>/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/</guid><description>&lt;p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &amp;ldquo;decoder-only&amp;rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.&lt;/p>
+&lt;p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google&amp;rsquo;s T5, or &lt;strong>Text-to-Text Transfer Transformer&lt;/strong>, stands out as one of the most influential. It didn&amp;rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.&lt;/p></description></item><item><title>Some useful files</title><link>/posts/useful/</link><pubDate>Mon, 26 Oct 2020 04:14:43 +0000</pubDate><guid>/posts/useful/</guid><description>&lt;ul>
 &lt;li>&lt;a href="https://ericxliu.me/rootCA.pem" class="external-link" target="_blank" rel="noopener">rootCA.pem&lt;/a>&lt;/li>
 &lt;li>&lt;a href="https://ericxliu.me/vpnclient.ovpn" class="external-link" target="_blank" rel="noopener">vpnclient.ovpn&lt;/a>&lt;/li>
 &lt;/ul></description></item><item><title/><link>/posts/a-deep-dive-into-ppo-for-language-models/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/posts/a-deep-dive-into-ppo-for-language-models/</guid><description>&lt;p>Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don&amp;rsquo;t inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).&lt;/p>
--- a/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html
+++ b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html
@@ -0,0 +1,33 @@
+<!doctype html><html lang=en><head><title>T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive · Eric X. Liu's Personal Page</title><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=color-scheme content="light dark"><meta name=author content="Eric X. Liu"><meta name=description content="In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.
+But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or Text-to-Text Transfer Transformer, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=fediverse:creator content><meta name=twitter:card content="summary"><meta name=twitter:title content="T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive"><meta name=twitter:description content="In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the “decoder-only” model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.
+But to truly understand the field, we must look at the pivotal models that explored different paths. Google’s T5, or Text-to-Text Transfer Transformer, stands out as one of the most influential. It didn’t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices."><meta property="og:url" content="/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive"><meta property="og:description" content="In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the “decoder-only” model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.
+But to truly understand the field, we must look at the pivotal models that explored different paths. Google’s T5, or Text-to-Text Transfer Transformer, stands out as one of the most influential. It didn’t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2025-08-03T01:47:10+00:00"><meta property="article:modified_time" content="2025-08-03T01:47:39+00:00"><link rel=canonical href=/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.60f552a2c0452fcc0254c54f21c3e0728460c1ae85f97a9c35833a222ef8b884.css integrity="sha256-YPVSosBFL8wCVMVPIcPgcoRgwa6F+XqcNYM6Ii74uIQ=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=/>Eric X. Liu's Personal Page
+</a><input type=checkbox id=menu-toggle>
+<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=https://rss.ericxliu.me/oauth2/oidc/redirect>RSS</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/>T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
+<time datetime=2025-08-03T01:47:10Z>August 3, 2025
+</time></span><span class=reading-time><i class="fa-solid fa-clock" aria-hidden=true></i>
+6-minute read</span></div></div></header><div class=post-content><p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.</p><p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or <strong>Text-to-Text Transfer Transformer</strong>, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.</p><h3 id=the-core-philosophy-everything-is-a-text-to-text-problem>The Core Philosophy: Everything is a Text-to-Text Problem
+<a class=heading-link href=#the-core-philosophy-everything-is-a-text-to-text-problem><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h3><p>The genius of T5 lies in its unifying framework. Instead of building different models or fine-tuning procedures for various NLP tasks, T5 reframes every task as a text-to-text problem. The model takes a string as input and generates a string as output, regardless of the underlying objective.</p><p>This is accomplished by adding a <strong>task prefix</strong> to the input. These prefixes are not conversational prompts like a GPT &ldquo;system prompt&rdquo;; they are learned triggers that the model is explicitly fine-tuned to recognize.</p><table><thead><tr><th style=text-align:left>Task</th><th style=text-align:left>T5 Input</th><th style=text-align:left>Expected T5 Output</th></tr></thead><tbody><tr><td style=text-align:left>Translation</td><td style=text-align:left><code>translate English to German: The cat is cute.</code></td><td style=text-align:left><code>Die Katze ist süß.</code></td></tr><tr><td style=text-align:left>Summarization</td><td style=text-align:left><code>summarize: [A long news article...]</code></td><td style=text-align:left><code>[A concise summary.]</code></td></tr><tr><td style=text-align:left>Classification</td><td style=text-align:left><code>cola sentence: The boys is walking.</code></td><td style=text-align:left><code>unacceptable</code></td></tr><tr><td style=text-align:left>Similarity</td><td style=text-align:left><code>stsb sentence1: The car is red. sentence2: The auto is crimson.</code></td><td style=text-align:left><code>4.8</code></td></tr></tbody></table><p>This elegant approach turns even classification into a generation task, where the model learns to generate the text of the correct label.</p><h3 id=the-engine-a-two-window-encoder-decoder-architecture>The Engine: A Two-Window Encoder-Decoder Architecture
+<a class=heading-link href=#the-engine-a-two-window-encoder-decoder-architecture><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h3><p>To execute this text-to-text mission, T5 uses the original Transformer&rsquo;s <strong>encoder-decoder architecture</strong>. This is the most significant point of divergence from modern decoder-only LLMs. The inference process works in two distinct stages:</p><h4 id=stage-1-the-encoder-the-understanding-window>Stage 1: The Encoder (The &ldquo;Understanding&rdquo; Window)
+<a class=heading-link href=#stage-1-the-encoder-the-understanding-window><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h4><p>When T5 receives an input like <code>summarize: [article text]</code>, the entire string is fed into the <strong>encoder</strong>.</p><ul><li><strong>Bidirectional Context:</strong> The encoder processes the input bidirectionally. Every token can see every other token in the input text simultaneously. This allows the model to build a deep, holistic understanding of the entire prompt and its context.</li><li><strong>Static Representation:</strong> The encoder&rsquo;s final output is not text. It&rsquo;s a set of numerical representations (hidden states) that encapsulates the meaning and intent of the input. This representation is generated once and remains static for the entire generation process.</li></ul><h4 id=stage-2-the-decoder-the-writing-window>Stage 2: The Decoder (The &ldquo;Writing&rdquo; Window)
+<a class=heading-link href=#stage-2-the-decoder-the-writing-window><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h4><p>The decoder is responsible for generating the output string token by token.</p><ul><li><strong>Autoregressive Generation:</strong> It begins with a <code>start-of-sequence</code> token and generates the output one word at a time.</li><li><strong>Cross-Attention:</strong> At each step, the decoder does two things: it looks at the text it has generated so far (its own &ldquo;decoder context&rdquo;), and crucially, it uses a mechanism called <strong>cross-attention</strong> to look back at the static representation created by the encoder. This allows the decoder&rsquo;s generation to be guided by the encoder&rsquo;s complete understanding of the prompt.</li><li><strong>Growing Context:</strong> The decoder&rsquo;s context window grows with each token it generates until it produces an <code>end-of-sequence</code> token, signaling that the task is complete.</li></ul><p>This two-window system is a powerful design, especially for tasks that require a full understanding of a source document before generating a new one (like translation or summarization).</p><h3 id=architectural-divergence-t5-vs-the-modern-llm-playbook>Architectural Divergence: T5 vs. The Modern LLM Playbook
+<a class=heading-link href=#architectural-divergence-t5-vs-the-modern-llm-playbook><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h3><p>Beyond its core architecture, T5 made several specific design choices that contrast with today&rsquo;s standards.</p><h4 id=1-positional-embeddings-relative-rpe-vs-rotary-rope>1. Positional Embeddings: Relative (RPE) vs. Rotary (RoPE)
+<a class=heading-link href=#1-positional-embeddings-relative-rpe-vs-rotary-rope><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h4><p>How a model knows the order of words is critical.</p><ul><li><strong>T5&rsquo;s Approach (RPE):</strong> T5 uses a form of <strong>Relative Positional Embedding</strong>. Instead of adding a position signal to the word embeddings, it adds a learned bias directly to the attention scores based on the relative distance between tokens. It&rsquo;s a clever way to encode position that is independent of sequence length.</li><li><strong>The Modern Standard (RoPE):</strong> Most modern LLMs (LLaMA, PaLM, Mistral) use <strong>Rotary Positional Embeddings</strong>. As detailed in the CS336 slides, RoPE works by mathematically <em>rotating</em> the Query and Key vectors based on their absolute position. This method has proven exceptionally effective for long sequences and is considered the current state-of-the-art.</li></ul><h4 id=2-the-feed-forward-network-an-extreme-experiment>2. The Feed-Forward Network: An Extreme Experiment
+<a class=heading-link href=#2-the-feed-forward-network-an-extreme-experiment><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h4><p>The Feed-Forward Network (FFN) inside each Transformer block is typically 4 times the model&rsquo;s hidden dimension (<code>d_model</code>). The original T5 11B model took a radical departure from this rule.</p><ul><li><strong>T5 11B&rsquo;s Choice:</strong> It used a small hidden dimension (<code>d_model = 1024</code>) but an astoundingly large FFN dimension (<code>d_ff = 65,536</code>), a <strong>64-times multiplier</strong>. The rationale was that modern accelerators (like Google&rsquo;s TPUs) are highly efficient at large, dense matrix multiplications.</li><li><strong>The Modern Standard:</strong> This experiment was not widely adopted. Later models, including T5&rsquo;s own successor <strong>T5 v1.1</strong>, reverted to the standard 4x multiplier (or ~2.66x when using GLU activations) for a better balance of parameters and performance.</li></ul><h4 id=3-denoising-span-corruption-vs-iterative-diffusion>3. Denoising: Span Corruption vs. Iterative Diffusion
+<a class=heading-link href=#3-denoising-span-corruption-vs-iterative-diffusion><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h4><p>While T5&rsquo;s pre-training is called &ldquo;denoising,&rdquo; it&rsquo;s conceptually different from the denoising in modern diffusion models.</p><ul><li><strong>T5&rsquo;s Denoising:</strong> This is <strong>span corruption</strong>. The model is shown a sentence with chunks of text masked out and learns to predict exactly what was removed in a single step. It&rsquo;s a fill-in-the-blanks task to learn rich language representations.</li><li><strong>Diffusion Denoising:</strong> This is a multi-step generative process. A clean text is gradually corrupted with noise, and the model learns to reverse this process step-by-step, allowing it to generate high-fidelity text from pure noise.</li></ul><h3 id=where-t5-was-ahead-of-its-time>Where T5 Was Ahead of its Time
+<a class=heading-link href=#where-t5-was-ahead-of-its-time><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h3><p>Despite its differences, the &ldquo;T5 v1.1&rdquo; variant pioneered several techniques that are now standard practice in the most advanced LLMs:</p><ul><li><strong>RMSNorm:</strong> It was one of the first major models to adopt Root Mean Square Normalization instead of LayerNorm, a choice now used by LLaMA, Mistral, and others for its efficiency and stability.</li><li><strong>Pre-Normalization:</strong> T5 applies the normalization layer <em>before</em> the attention and FFN blocks, a critical technique for enabling stable training of very deep networks.</li><li><strong>No Bias Terms:</strong> T5 v1.1 removed the bias parameters from its normalization and FFN layers, a small but important optimization for memory and stability that modern models follow.</li><li><strong>Gated Activations (GeGLU):</strong> While the original T5 used ReLU, T5 v1.1 adopted a Gated Linear Unit (GeGLU), presaging the move to GLU-family activations (like SwiGLU) that is now ubiquitous.</li></ul><h3 id=conclusion-the-lasting-legacy>Conclusion: The Lasting Legacy
+<a class=heading-link href=#conclusion-the-lasting-legacy><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
+<span class=sr-only>Link to heading</span></a></h3><p>T5 represents a different evolutionary branch in the Transformer family tree. While the field has largely converged on the decoder-only architecture for its scalability in general-purpose models, T5&rsquo;s design remains a masterclass in purpose-built engineering.</p><p>Its text-to-text framework was revolutionary, its encoder-decoder structure is still a go-to for tasks like translation, and its refined T5 v1.1 architecture laid the groundwork for many of the stability and efficiency tricks we see in today&rsquo;s state-of-the-art models. T5 is more than just a model; it&rsquo;s a crucial case study in the architectural trade-offs that continue to shape the future of artificial intelligence.</p></div><footer></footer></article></section></div><footer class=footer><section class=container>©
+2016 -
+2025
+Eric X. Liu
+<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/b6192ca">[b6192ca]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script></body></html>
--- a/posts/useful/index.html
+++ b/posts/useful/index.html
@@ -10,4 +10,4 @@ One-minute read</span></div></div></header><div class=post-content><ul><li><a hr
 2016 -
 2025
 Eric X. Liu
-<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/a3ccac4">[a3ccac4]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script></body></html>
+<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/b6192ca">[b6192ca]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script></body></html>