<aclass=titlehref=/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/>T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive</a></li><li><spanclass=date>October 26, 2020</span>
<?xml version="1.0" encoding="utf-8" standalone="yes"?><rssversion="2.0"xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Posts on Eric X. Liu's Personal Page</title><link>/posts/</link><description>Recent content in Posts on Eric X. Liu's Personal Page</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 02 Aug 2025 15:46:24 -0700</lastBuildDate><atom:linkhref="/posts/index.xml"rel="self"type="application/rss+xml"/><item><title>Some useful files</title><link>/posts/useful/</link><pubDate>Mon, 26 Oct 2020 04:14:43 +0000</pubDate><guid>/posts/useful/</guid><description><ul>
<?xml version="1.0" encoding="utf-8" standalone="yes"?><rssversion="2.0"xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Posts on Eric X. Liu's Personal Page</title><link>/posts/</link><description>Recent content in Posts on Eric X. Liu's Personal Page</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sun, 03 Aug 2025 01:47:39 +0000</lastBuildDate><atom:linkhref="/posts/index.xml"rel="self"type="application/rss+xml"/><item><title>T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive</title><link>/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/</link><pubDate>Sun, 03 Aug 2025 01:47:10 +0000</pubDate><guid>/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/</guid><description><p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.</p>
<p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or <strong>Text-to-Text Transfer Transformer</strong>, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.</p></description></item><item><title>Some useful files</title><link>/posts/useful/</link><pubDate>Mon, 26 Oct 2020 04:14:43 +0000</pubDate><guid>/posts/useful/</guid><description><ul>
</ul></description></item><item><title/><link>/posts/a-deep-dive-into-ppo-for-language-models/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/posts/a-deep-dive-into-ppo-for-language-models/</guid><description><p>Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don&rsquo;t inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).</p>
<!doctype html><htmllang=en><head><title>T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive · Eric X. Liu's Personal Page</title><metacharset=utf-8><metaname=viewportcontent="width=device-width,initial-scale=1"><metaname=color-schemecontent="light dark"><metaname=authorcontent="Eric X. Liu"><metaname=descriptioncontent="In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the “decoder-only” model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.
But to truly understand the field, we must look at the pivotal models that explored different paths. Google’s T5, or Text-to-Text Transfer Transformer, stands out as one of the most influential. It didn’t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices."><metaname=keywordscontent="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><metaname=fediverse:creatorcontent><metaname=twitter:cardcontent="summary"><metaname=twitter:titlecontent="T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive"><metaname=twitter:descriptioncontent="In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the “decoder-only” model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.
But to truly understand the field, we must look at the pivotal models that explored different paths. Google’s T5, or Text-to-Text Transfer Transformer, stands out as one of the most influential. It didn’t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices."><metaproperty="og:url"content="/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/"><metaproperty="og:site_name"content="Eric X. Liu's Personal Page"><metaproperty="og:title"content="T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive"><metaproperty="og:description"content="In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the “decoder-only” model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.
But to truly understand the field, we must look at the pivotal models that explored different paths. Google’s T5, or Text-to-Text Transfer Transformer, stands out as one of the most influential. It didn’t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices."><metaproperty="og:locale"content="en"><metaproperty="og:type"content="article"><metaproperty="article:section"content="posts"><metaproperty="article:published_time"content="2025-08-03T01:47:10+00:00"><metaproperty="article:modified_time"content="2025-08-03T01:47:39+00:00"><linkrel=canonicalhref=/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/><linkrel=preloadhref=/fonts/fa-brands-400.woff2as=fonttype=font/woff2crossorigin><linkrel=preloadhref=/fonts/fa-regular-400.woff2as=fonttype=font/woff2crossorigin><linkrel=preloadhref=/fonts/fa-solid-900.woff2as=fonttype=font/woff2crossorigin><linkrel=stylesheethref=/css/coder.min.60f552a2c0452fcc0254c54f21c3e0728460c1ae85f97a9c35833a222ef8b884.cssintegrity="sha256-YPVSosBFL8wCVMVPIcPgcoRgwa6F+XqcNYM6Ii74uIQ="crossorigin=anonymousmedia=screen><linkrel=stylesheethref=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.cssintegrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8="crossorigin=anonymousmedia=screen><linkrel=icontype=image/svg+xmlhref=/images/favicon.svgsizes=any><linkrel=icontype=image/pnghref=/images/favicon-32x32.pngsizes=32x32><linkrel=icontype=image/pnghref=/images/favicon-16x16.pngsizes=16x16><linkrel=apple-touch-iconhref=/images/apple-touch-icon.png><linkrel=apple-touch-iconsizes=180x180href=/images/apple-touch-icon.png><linkrel=manifesthref=/site.webmanifest><linkrel=mask-iconhref=/images/safari-pinned-tab.svgcolor=#5bbad5></head><bodyclass="preload-transitions colorscheme-auto"><divclass=float-container><aid=dark-mode-toggleclass=colorscheme-toggle><iclass="fa-solid fa-adjust fa-fw"aria-hidden=true></i></a></div><mainclass=wrapper><navclass=navigation><sectionclass=container><aclass=navigation-titlehref=/>Eric X. Liu's Personal Page
</a><inputtype=checkboxid=menu-toggle>
<labelclass="menu-button float-right"for=menu-toggle><iclass="fa-solid fa-bars fa-fw"aria-hidden=true></i></label><ulclass=navigation-list><liclass=navigation-item><aclass=navigation-linkhref=/posts/>Posts</a></li><liclass=navigation-item><aclass=navigation-linkhref=https://chat.ericxliu.me>Chat</a></li><liclass=navigation-item><aclass=navigation-linkhref=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><liclass=navigation-item><aclass=navigation-linkhref=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><liclass=navigation-item><aclass=navigation-linkhref=https://rss.ericxliu.me/oauth2/oidc/redirect>RSS</a></li><liclass=navigation-item><aclass=navigation-linkhref=/>|</a></li><liclass=navigation-item><aclass=navigation-linkhref=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><divclass=content><sectionclass="container post"><article><header><divclass=post-title><h1class=title><aclass=title-linkhref=/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/>T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive</a></h1></div><divclass=post-meta><divclass=date><spanclass=posted-on><iclass="fa-solid fa-calendar"aria-hidden=true></i>
6-minute read</span></div></div></header><divclass=post-content><p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the “decoder-only” model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.</p><p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google’s T5, or <strong>Text-to-Text Transfer Transformer</strong>, stands out as one of the most influential. It didn’t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.</p><h3id=the-core-philosophy-everything-is-a-text-to-text-problem>The Core Philosophy: Everything is a Text-to-Text Problem
<aclass=heading-linkhref=#the-core-philosophy-everything-is-a-text-to-text-problem><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>The genius of T5 lies in its unifying framework. Instead of building different models or fine-tuning procedures for various NLP tasks, T5 reframes every task as a text-to-text problem. The model takes a string as input and generates a string as output, regardless of the underlying objective.</p><p>This is accomplished by adding a <strong>task prefix</strong> to the input. These prefixes are not conversational prompts like a GPT “system prompt”; they are learned triggers that the model is explicitly fine-tuned to recognize.</p><table><thead><tr><thstyle=text-align:left>Task</th><thstyle=text-align:left>T5 Input</th><thstyle=text-align:left>Expected T5 Output</th></tr></thead><tbody><tr><tdstyle=text-align:left>Translation</td><tdstyle=text-align:left><code>translate English to German: The cat is cute.</code></td><tdstyle=text-align:left><code>Die Katze ist süß.</code></td></tr><tr><tdstyle=text-align:left>Summarization</td><tdstyle=text-align:left><code>summarize: [A long news article...]</code></td><tdstyle=text-align:left><code>[A concise summary.]</code></td></tr><tr><tdstyle=text-align:left>Classification</td><tdstyle=text-align:left><code>cola sentence: The boys is walking.</code></td><tdstyle=text-align:left><code>unacceptable</code></td></tr><tr><tdstyle=text-align:left>Similarity</td><tdstyle=text-align:left><code>stsb sentence1: The car is red. sentence2: The auto is crimson.</code></td><tdstyle=text-align:left><code>4.8</code></td></tr></tbody></table><p>This elegant approach turns even classification into a generation task, where the model learns to generate the text of the correct label.</p><h3id=the-engine-a-two-window-encoder-decoder-architecture>The Engine: A Two-Window Encoder-Decoder Architecture
<aclass=heading-linkhref=#the-engine-a-two-window-encoder-decoder-architecture><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>To execute this text-to-text mission, T5 uses the original Transformer’s <strong>encoder-decoder architecture</strong>. This is the most significant point of divergence from modern decoder-only LLMs. The inference process works in two distinct stages:</p><h4id=stage-1-the-encoder-the-understanding-window>Stage 1: The Encoder (The “Understanding” Window)
<aclass=heading-linkhref=#stage-1-the-encoder-the-understanding-window><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h4><p>When T5 receives an input like <code>summarize: [article text]</code>, the entire string is fed into the <strong>encoder</strong>.</p><ul><li><strong>Bidirectional Context:</strong> The encoder processes the input bidirectionally. Every token can see every other token in the input text simultaneously. This allows the model to build a deep, holistic understanding of the entire prompt and its context.</li><li><strong>Static Representation:</strong> The encoder’s final output is not text. It’s a set of numerical representations (hidden states) that encapsulates the meaning and intent of the input. This representation is generated once and remains static for the entire generation process.</li></ul><h4id=stage-2-the-decoder-the-writing-window>Stage 2: The Decoder (The “Writing” Window)
<aclass=heading-linkhref=#stage-2-the-decoder-the-writing-window><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h4><p>The decoder is responsible for generating the output string token by token.</p><ul><li><strong>Autoregressive Generation:</strong> It begins with a <code>start-of-sequence</code> token and generates the output one word at a time.</li><li><strong>Cross-Attention:</strong> At each step, the decoder does two things: it looks at the text it has generated so far (its own “decoder context”), and crucially, it uses a mechanism called <strong>cross-attention</strong> to look back at the static representation created by the encoder. This allows the decoder’s generation to be guided by the encoder’s complete understanding of the prompt.</li><li><strong>Growing Context:</strong> The decoder’s context window grows with each token it generates until it produces an <code>end-of-sequence</code> token, signaling that the task is complete.</li></ul><p>This two-window system is a powerful design, especially for tasks that require a full understanding of a source document before generating a new one (like translation or summarization).</p><h3id=architectural-divergence-t5-vs-the-modern-llm-playbook>Architectural Divergence: T5 vs. The Modern LLM Playbook
<aclass=heading-linkhref=#architectural-divergence-t5-vs-the-modern-llm-playbook><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>Beyond its core architecture, T5 made several specific design choices that contrast with today’s standards.</p><h4id=1-positional-embeddings-relative-rpe-vs-rotary-rope>1. Positional Embeddings: Relative (RPE) vs. Rotary (RoPE)
<aclass=heading-linkhref=#1-positional-embeddings-relative-rpe-vs-rotary-rope><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h4><p>How a model knows the order of words is critical.</p><ul><li><strong>T5’s Approach (RPE):</strong> T5 uses a form of <strong>Relative Positional Embedding</strong>. Instead of adding a position signal to the word embeddings, it adds a learned bias directly to the attention scores based on the relative distance between tokens. It’s a clever way to encode position that is independent of sequence length.</li><li><strong>The Modern Standard (RoPE):</strong> Most modern LLMs (LLaMA, PaLM, Mistral) use <strong>Rotary Positional Embeddings</strong>. As detailed in the CS336 slides, RoPE works by mathematically <em>rotating</em> the Query and Key vectors based on their absolute position. This method has proven exceptionally effective for long sequences and is considered the current state-of-the-art.</li></ul><h4id=2-the-feed-forward-network-an-extreme-experiment>2. The Feed-Forward Network: An Extreme Experiment
<aclass=heading-linkhref=#2-the-feed-forward-network-an-extreme-experiment><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h4><p>The Feed-Forward Network (FFN) inside each Transformer block is typically 4 times the model’s hidden dimension (<code>d_model</code>). The original T5 11B model took a radical departure from this rule.</p><ul><li><strong>T5 11B’s Choice:</strong> It used a small hidden dimension (<code>d_model = 1024</code>) but an astoundingly large FFN dimension (<code>d_ff = 65,536</code>), a <strong>64-times multiplier</strong>. The rationale was that modern accelerators (like Google’s TPUs) are highly efficient at large, dense matrix multiplications.</li><li><strong>The Modern Standard:</strong> This experiment was not widely adopted. Later models, including T5’s own successor <strong>T5 v1.1</strong>, reverted to the standard 4x multiplier (or ~2.66x when using GLU activations) for a better balance of parameters and performance.</li></ul><h4id=3-denoising-span-corruption-vs-iterative-diffusion>3. Denoising: Span Corruption vs. Iterative Diffusion
<aclass=heading-linkhref=#3-denoising-span-corruption-vs-iterative-diffusion><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h4><p>While T5’s pre-training is called “denoising,” it’s conceptually different from the denoising in modern diffusion models.</p><ul><li><strong>T5’s Denoising:</strong> This is <strong>span corruption</strong>. The model is shown a sentence with chunks of text masked out and learns to predict exactly what was removed in a single step. It’s a fill-in-the-blanks task to learn rich language representations.</li><li><strong>Diffusion Denoising:</strong> This is a multi-step generative process. A clean text is gradually corrupted with noise, and the model learns to reverse this process step-by-step, allowing it to generate high-fidelity text from pure noise.</li></ul><h3id=where-t5-was-ahead-of-its-time>Where T5 Was Ahead of its Time
<aclass=heading-linkhref=#where-t5-was-ahead-of-its-time><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>Despite its differences, the “T5 v1.1” variant pioneered several techniques that are now standard practice in the most advanced LLMs:</p><ul><li><strong>RMSNorm:</strong> It was one of the first major models to adopt Root Mean Square Normalization instead of LayerNorm, a choice now used by LLaMA, Mistral, and others for its efficiency and stability.</li><li><strong>Pre-Normalization:</strong> T5 applies the normalization layer <em>before</em> the attention and FFN blocks, a critical technique for enabling stable training of very deep networks.</li><li><strong>No Bias Terms:</strong> T5 v1.1 removed the bias parameters from its normalization and FFN layers, a small but important optimization for memory and stability that modern models follow.</li><li><strong>Gated Activations (GeGLU):</strong> While the original T5 used ReLU, T5 v1.1 adopted a Gated Linear Unit (GeGLU), presaging the move to GLU-family activations (like SwiGLU) that is now ubiquitous.</li></ul><h3id=conclusion-the-lasting-legacy>Conclusion: The Lasting Legacy
<aclass=heading-linkhref=#conclusion-the-lasting-legacy><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.