ericxliu-me/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html

<!doctype html><html lang=en><head><title>An Architectural Deep Dive of T5 · Eric X. Liu's Personal Page</title><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=color-scheme content="light dark"><meta name=author content="Eric X. Liu"><meta name=description content="In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.
But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or Text-to-Text Transfer Transformer, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=twitter:card content="summary"><meta name=twitter:title content="An Architectural Deep Dive of T5"><meta name=twitter:description content="In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the “decoder-only” model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.
But to truly understand the field, we must look at the pivotal models that explored different paths. Google’s T5, or Text-to-Text Transfer Transformer, stands out as one of the most influential. It didn’t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices."><meta property="og:url" content="/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="An Architectural Deep Dive of T5"><meta property="og:description" content="In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the “decoder-only” model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.
But to truly understand the field, we must look at the pivotal models that explored different paths. Google’s T5, or Text-to-Text Transfer Transformer, stands out as one of the most influential. It didn’t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2025-06-01T00:00:00+00:00"><meta property="article:modified_time" content="2025-08-03T03:41:10+00:00"><link rel=canonical href=/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.6445a802b9389c9660e1b07b724dcf5718b1065ed2d71b4eeaf981cc7cc5fc46.css integrity="sha256-ZEWoArk4nJZg4bB7ck3PVxixBl7S1xtO6vmBzHzF/EY=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=/>Eric X. Liu's Personal Page
</a><input type=checkbox id=menu-toggle>
<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=https://rss.ericxliu.me/oauth2/oidc/redirect>RSS</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/>An Architectural Deep Dive of T5</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
<time datetime=2025-06-01T00:00:00Z>June 1, 2025
</time></span><span class=reading-time><i class="fa-solid fa-clock" aria-hidden=true></i>
6-minute read</span></div></div></header><div class=post-content><p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.</p><p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or <strong>Text-to-Text Transfer Transformer</strong>, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.</p><h3 id=the-core-philosophy-everything-is-a-text-to-text-problem>The Core Philosophy: Everything is a Text-to-Text Problem
<a class=heading-link href=#the-core-philosophy-everything-is-a-text-to-text-problem><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>The genius of T5 lies in its unifying framework. Instead of building different models or fine-tuning procedures for various NLP tasks, T5 reframes every task as a text-to-text problem. The model takes a string as input and generates a string as output, regardless of the underlying objective.</p><p>This is accomplished by adding a <strong>task prefix</strong> to the input. These prefixes are not conversational prompts like a GPT &ldquo;system prompt&rdquo;; they are learned triggers that the model is explicitly fine-tuned to recognize.</p><table><thead><tr><th style=text-align:left>Task</th><th style=text-align:left>T5 Input</th><th style=text-align:left>Expected T5 Output</th></tr></thead><tbody><tr><td style=text-align:left>Translation</td><td style=text-align:left><code>translate English to German: The cat is cute.</code></td><td style=text-align:left><code>Die Katze ist süß.</code></td></tr><tr><td style=text-align:left>Summarization</td><td style=text-align:left><code>summarize: [A long news article...]</code></td><td style=text-align:left><code>[A concise summary.]</code></td></tr><tr><td style=text-align:left>Classification</td><td style=text-align:left><code>cola sentence: The boys is walking.</code></td><td style=text-align:left><code>unacceptable</code></td></tr><tr><td style=text-align:left>Similarity</td><td style=text-align:left><code>stsb sentence1: The car is red. sentence2: The auto is crimson.</code></td><td style=text-align:left><code>4.8</code></td></tr></tbody></table><p>This elegant approach turns even classification into a generation task, where the model learns to generate the text of the correct label.</p><h3 id=the-engine-a-two-window-encoder-decoder-architecture>The Engine: A Two-Window Encoder-Decoder Architecture
<a class=heading-link href=#the-engine-a-two-window-encoder-decoder-architecture><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>To execute this text-to-text mission, T5 uses the original Transformer&rsquo;s <strong>encoder-decoder architecture</strong>. This is the most significant point of divergence from modern decoder-only LLMs. The inference process works in two distinct stages:</p><h4 id=stage-1-the-encoder-the-understanding-window>Stage 1: The Encoder (The &ldquo;Understanding&rdquo; Window)
<a class=heading-link href=#stage-1-the-encoder-the-understanding-window><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h4><p>When T5 receives an input like <code>summarize: [article text]</code>, the entire string is fed into the <strong>encoder</strong>.</p><ul><li><strong>Bidirectional Context:</strong> The encoder processes the input bidirectionally. Every token can see every other token in the input text simultaneously. This allows the model to build a deep, holistic understanding of the entire prompt and its context.</li><li><strong>Static Representation:</strong> The encoder&rsquo;s final output is not text. It&rsquo;s a set of numerical representations (hidden states) that encapsulates the meaning and intent of the input. This representation is generated once and remains static for the entire generation process.</li></ul><h4 id=stage-2-the-decoder-the-writing-window>Stage 2: The Decoder (The &ldquo;Writing&rdquo; Window)
<a class=heading-link href=#stage-2-the-decoder-the-writing-window><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h4><p>The decoder is responsible for generating the output string token by token.</p><ul><li><strong>Autoregressive Generation:</strong> It begins with a <code>start-of-sequence</code> token and generates the output one word at a time.</li><li><strong>Cross-Attention:</strong> At each step, the decoder does two things: it looks at the text it has generated so far (its own &ldquo;decoder context&rdquo;), and crucially, it uses a mechanism called <strong>cross-attention</strong> to look back at the static representation created by the encoder. This allows the decoder&rsquo;s generation to be guided by the encoder&rsquo;s complete understanding of the prompt.</li><li><strong>Growing Context:</strong> The decoder&rsquo;s context window grows with each token it generates until it produces an <code>end-of-sequence</code> token, signaling that the task is complete.</li></ul><p>This two-window system is a powerful design, especially for tasks that require a full understanding of a source document before generating a new one (like translation or summarization).</p><h3 id=architectural-divergence-t5-vs-the-modern-llm-playbook>Architectural Divergence: T5 vs. The Modern LLM Playbook
<a class=heading-link href=#architectural-divergence-t5-vs-the-modern-llm-playbook><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>Beyond its core architecture, T5 made several specific design choices that contrast with today&rsquo;s standards.</p><h4 id=1-positional-embeddings-relative-rpe-vs-rotary-rope>1. Positional Embeddings: Relative (RPE) vs. Rotary (RoPE)
<a class=heading-link href=#1-positional-embeddings-relative-rpe-vs-rotary-rope><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h4><p>How a model knows the order of words is critical.</p><ul><li><strong>T5&rsquo;s Approach (RPE):</strong> T5 uses a form of <strong>Relative Positional Embedding</strong>. Instead of adding a position signal to the word embeddings, it adds a learned bias directly to the attention scores based on the relative distance between tokens. It&rsquo;s a clever way to encode position that is independent of sequence length.</li><li><strong>The Modern Standard (RoPE):</strong> Most modern LLMs (LLaMA, PaLM, Mistral) use <strong>Rotary Positional Embeddings</strong>. As detailed in the CS336 slides, RoPE works by mathematically <em>rotating</em> the Query and Key vectors based on their absolute position. This method has proven exceptionally effective for long sequences and is considered the current state-of-the-art.</li></ul><h4 id=2-the-feed-forward-network-an-extreme-experiment>2. The Feed-Forward Network: An Extreme Experiment
<a class=heading-link href=#2-the-feed-forward-network-an-extreme-experiment><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h4><p>The Feed-Forward Network (FFN) inside each Transformer block is typically 4 times the model&rsquo;s hidden dimension (<code>d_model</code>). The original T5 11B model took a radical departure from this rule.</p><ul><li><strong>T5 11B&rsquo;s Choice:</strong> It used a small hidden dimension (<code>d_model = 1024</code>) but an astoundingly large FFN dimension (<code>d_ff = 65,536</code>), a <strong>64-times multiplier</strong>. The rationale was that modern accelerators (like Google&rsquo;s TPUs) are highly efficient at large, dense matrix multiplications.</li><li><strong>The Modern Standard:</strong> This experiment was not widely adopted. Later models, including T5&rsquo;s own successor <strong>T5 v1.1</strong>, reverted to the standard 4x multiplier (or ~2.66x when using GLU activations) for a better balance of parameters and performance.</li></ul><h4 id=3-denoising-span-corruption-vs-iterative-diffusion>3. Denoising: Span Corruption vs. Iterative Diffusion
<a class=heading-link href=#3-denoising-span-corruption-vs-iterative-diffusion><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h4><p>While T5&rsquo;s pre-training is called &ldquo;denoising,&rdquo; it&rsquo;s conceptually different from the denoising in modern diffusion models.</p><ul><li><strong>T5&rsquo;s Denoising:</strong> This is <strong>span corruption</strong>. The model is shown a sentence with chunks of text masked out and learns to predict exactly what was removed in a single step. It&rsquo;s a fill-in-the-blanks task to learn rich language representations.</li><li><strong>Diffusion Denoising:</strong> This is a multi-step generative process. A clean text is gradually corrupted with noise, and the model learns to reverse this process step-by-step, allowing it to generate high-fidelity text from pure noise.</li></ul><h3 id=where-t5-was-ahead-of-its-time>Where T5 Was Ahead of its Time
<a class=heading-link href=#where-t5-was-ahead-of-its-time><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>Despite its differences, the &ldquo;T5 v1.1&rdquo; variant pioneered several techniques that are now standard practice in the most advanced LLMs:</p><ul><li><strong>RMSNorm:</strong> It was one of the first major models to adopt Root Mean Square Normalization instead of LayerNorm, a choice now used by LLaMA, Mistral, and others for its efficiency and stability.</li><li><strong>Pre-Normalization:</strong> T5 applies the normalization layer <em>before</em> the attention and FFN blocks, a critical technique for enabling stable training of very deep networks.</li><li><strong>No Bias Terms:</strong> T5 v1.1 removed the bias parameters from its normalization and FFN layers, a small but important optimization for memory and stability that modern models follow.</li><li><strong>Gated Activations (GeGLU):</strong> While the original T5 used ReLU, T5 v1.1 adopted a Gated Linear Unit (GeGLU), presaging the move to GLU-family activations (like SwiGLU) that is now ubiquitous.</li></ul><h3 id=conclusion-the-lasting-legacy>Conclusion: The Lasting Legacy
<a class=heading-link href=#conclusion-the-lasting-legacy><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>T5 represents a different evolutionary branch in the Transformer family tree. While the field has largely converged on the decoder-only architecture for its scalability in general-purpose models, T5&rsquo;s design remains a masterclass in purpose-built engineering.</p><p>Its text-to-text framework was revolutionary, its encoder-decoder structure is still a go-to for tasks like translation, and its refined T5 v1.1 architecture laid the groundwork for many of the stability and efficiency tricks we see in today&rsquo;s state-of-the-art models. T5 is more than just a model; it&rsquo;s a crucial case study in the architectural trade-offs that continue to shape the future of artificial intelligence.</p></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css integrity=sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0 crossorigin=anonymous><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js integrity=sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4 crossorigin=anonymous></script><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js integrity=sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05 crossorigin=anonymous onload='renderMathInElement(document.body,{delimiters:[{left:"$$",right:"$$",display:!0},{left:"$",right:"$",display:!1},{left:"\\(",right:"\\)",display:!1},{left:"\\[",right:"\\]",display:!0}]})'></script></section></div><footer class=footer><section class=container>©
2016 -
2025
Eric X. Liu
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/eba296f">[eba296f]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>