From 8f3c545991a9e6f23324d0343b73ab28c2855662 Mon Sep 17 00:00:00 2001 From: eric Date: Sun, 3 Aug 2025 02:20:56 +0000 Subject: [PATCH] deploy: b6192ca3ca4bd7e37585537778b12154446514c6 --- 404.html | 2 +- about/index.html | 2 +- categories/index.html | 2 +- index.html | 2 +- index.xml | 3 +- .../index.html | 2 +- posts/index.html | 5 +-- posts/index.xml | 3 +- .../index.html | 33 +++++++++++++++++++ posts/useful/index.html | 2 +- sitemap.xml | 2 +- tags/index.html | 2 +- 12 files changed, 48 insertions(+), 12 deletions(-) create mode 100644 posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html diff --git a/404.html b/404.html index 8d5c61c..341206c 100644 --- a/404.html +++ b/404.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[a3ccac4] \ No newline at end of file +[b6192ca] \ No newline at end of file diff --git a/about/index.html b/about/index.html index b15de16..bd55d86 100644 --- a/about/index.html +++ b/about/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[a3ccac4] \ No newline at end of file +[b6192ca] \ No newline at end of file diff --git a/categories/index.html b/categories/index.html index d2fa541..115f7a8 100644 --- a/categories/index.html +++ b/categories/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[a3ccac4] \ No newline at end of file +[b6192ca] \ No newline at end of file diff --git a/index.html b/index.html index 024057b..4a6c088 100644 --- a/index.html +++ b/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[a3ccac4] \ No newline at end of file +[b6192ca] \ No newline at end of file diff --git a/index.xml b/index.xml index a62945b..bddae8d 100644 --- a/index.xml +++ b/index.xml @@ -1,4 +1,5 @@ -Eric X. Liu's Personal Page/Recent content on Eric X. Liu's Personal PageHugoenSat, 02 Aug 2025 15:46:24 -0700Some useful files/posts/useful/Mon, 26 Oct 2020 04:14:43 +0000/posts/useful/<ul> +Eric X. Liu's Personal Page/Recent content on Eric X. Liu's Personal PageHugoenSun, 03 Aug 2025 01:47:39 +0000T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/Sun, 03 Aug 2025 01:47:10 +0000/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/<p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.</p> +<p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or <strong>Text-to-Text Transfer Transformer</strong>, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.</p>Some useful files/posts/useful/Mon, 26 Oct 2020 04:14:43 +0000/posts/useful/<ul> <li><a href="https://ericxliu.me/rootCA.pem" class="external-link" target="_blank" rel="noopener">rootCA.pem</a></li> <li><a href="https://ericxliu.me/vpnclient.ovpn" class="external-link" target="_blank" rel="noopener">vpnclient.ovpn</a></li> </ul>About/about/Fri, 01 Jun 2018 07:13:52 +0000/about/<link>/posts/a-deep-dive-into-ppo-for-language-models/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/posts/a-deep-dive-into-ppo-for-language-models/</guid><description><p>Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don&rsquo;t inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).</p> diff --git a/posts/a-deep-dive-into-ppo-for-language-models/index.html b/posts/a-deep-dive-into-ppo-for-language-models/index.html index 47be9c2..63e7a9b 100644 --- a/posts/a-deep-dive-into-ppo-for-language-models/index.html +++ b/posts/a-deep-dive-into-ppo-for-language-models/index.html @@ -23,4 +23,4 @@ where <code>δ_t = r_t + γV(s_{t+1}) - V(s_t)</code></p><ul><li><strong>γ (gam 2016 - 2025 Eric X. Liu -<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/a3ccac4">[a3ccac4]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script></body></html> \ No newline at end of file +<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/b6192ca">[b6192ca]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script></body></html> \ No newline at end of file diff --git a/posts/index.html b/posts/index.html index 826ada5..9990fa3 100644 --- a/posts/index.html +++ b/posts/index.html @@ -1,9 +1,10 @@ <!doctype html><html lang=en><head><title>Posts · Eric X. Liu's Personal Page
\ No newline at end of file +[b6192ca] \ No newline at end of file diff --git a/posts/index.xml b/posts/index.xml index d7fc227..1c639b3 100644 --- a/posts/index.xml +++ b/posts/index.xml @@ -1,4 +1,5 @@ -Posts on Eric X. Liu's Personal Page/posts/Recent content in Posts on Eric X. Liu's Personal PageHugoenSat, 02 Aug 2025 15:46:24 -0700Some useful files/posts/useful/Mon, 26 Oct 2020 04:14:43 +0000/posts/useful/<ul> +Posts on Eric X. Liu's Personal Page/posts/Recent content in Posts on Eric X. Liu's Personal PageHugoenSun, 03 Aug 2025 01:47:39 +0000T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/Sun, 03 Aug 2025 01:47:10 +0000/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/<p>In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the &ldquo;decoder-only&rdquo; model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.</p> +<p>But to truly understand the field, we must look at the pivotal models that explored different paths. Google&rsquo;s T5, or <strong>Text-to-Text Transfer Transformer</strong>, stands out as one of the most influential. It didn&rsquo;t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.</p>Some useful files/posts/useful/Mon, 26 Oct 2020 04:14:43 +0000/posts/useful/<ul> <li><a href="https://ericxliu.me/rootCA.pem" class="external-link" target="_blank" rel="noopener">rootCA.pem</a></li> <li><a href="https://ericxliu.me/vpnclient.ovpn" class="external-link" target="_blank" rel="noopener">vpnclient.ovpn</a></li> </ul><link>/posts/a-deep-dive-into-ppo-for-language-models/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/posts/a-deep-dive-into-ppo-for-language-models/</guid><description><p>Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don&rsquo;t inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).</p> diff --git a/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html new file mode 100644 index 0000000..9e66401 --- /dev/null +++ b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html @@ -0,0 +1,33 @@ +<!doctype html><html lang=en><head><title>T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive · Eric X. Liu's Personal Page

T5 - The Transformer That Zigged When Others Zagged - An Architectural Deep Dive

In the rapidly evolving landscape of Large Language Models, a few key architectures define the dominant paradigms. Today, the “decoder-only” model, popularized by the GPT series and its successors like LLaMA and Mistral, reigns supreme. These models are scaled to incredible sizes and excel at in-context learning.

But to truly understand the field, we must look at the pivotal models that explored different paths. Google’s T5, or Text-to-Text Transfer Transformer, stands out as one of the most influential. It didn’t just introduce a new model; it proposed a new philosophy. This article dives deep into the architecture of T5, how it fundamentally differs from modern LLMs, and the lasting legacy of its unique design choices.

The Core Philosophy: Everything is a Text-to-Text Problem + +Link to heading

The genius of T5 lies in its unifying framework. Instead of building different models or fine-tuning procedures for various NLP tasks, T5 reframes every task as a text-to-text problem. The model takes a string as input and generates a string as output, regardless of the underlying objective.

This is accomplished by adding a task prefix to the input. These prefixes are not conversational prompts like a GPT “system prompt”; they are learned triggers that the model is explicitly fine-tuned to recognize.

TaskT5 InputExpected T5 Output
Translationtranslate English to German: The cat is cute.Die Katze ist süß.
Summarizationsummarize: [A long news article...][A concise summary.]
Classificationcola sentence: The boys is walking.unacceptable
Similaritystsb sentence1: The car is red. sentence2: The auto is crimson.4.8

This elegant approach turns even classification into a generation task, where the model learns to generate the text of the correct label.

The Engine: A Two-Window Encoder-Decoder Architecture + +Link to heading

To execute this text-to-text mission, T5 uses the original Transformer’s encoder-decoder architecture. This is the most significant point of divergence from modern decoder-only LLMs. The inference process works in two distinct stages:

Stage 1: The Encoder (The “Understanding” Window) + +Link to heading

When T5 receives an input like summarize: [article text], the entire string is fed into the encoder.

  • Bidirectional Context: The encoder processes the input bidirectionally. Every token can see every other token in the input text simultaneously. This allows the model to build a deep, holistic understanding of the entire prompt and its context.
  • Static Representation: The encoder’s final output is not text. It’s a set of numerical representations (hidden states) that encapsulates the meaning and intent of the input. This representation is generated once and remains static for the entire generation process.

Stage 2: The Decoder (The “Writing” Window) + +Link to heading

The decoder is responsible for generating the output string token by token.

  • Autoregressive Generation: It begins with a start-of-sequence token and generates the output one word at a time.
  • Cross-Attention: At each step, the decoder does two things: it looks at the text it has generated so far (its own “decoder context”), and crucially, it uses a mechanism called cross-attention to look back at the static representation created by the encoder. This allows the decoder’s generation to be guided by the encoder’s complete understanding of the prompt.
  • Growing Context: The decoder’s context window grows with each token it generates until it produces an end-of-sequence token, signaling that the task is complete.

This two-window system is a powerful design, especially for tasks that require a full understanding of a source document before generating a new one (like translation or summarization).

Architectural Divergence: T5 vs. The Modern LLM Playbook + +Link to heading

Beyond its core architecture, T5 made several specific design choices that contrast with today’s standards.

1. Positional Embeddings: Relative (RPE) vs. Rotary (RoPE) + +Link to heading

How a model knows the order of words is critical.

  • T5’s Approach (RPE): T5 uses a form of Relative Positional Embedding. Instead of adding a position signal to the word embeddings, it adds a learned bias directly to the attention scores based on the relative distance between tokens. It’s a clever way to encode position that is independent of sequence length.
  • The Modern Standard (RoPE): Most modern LLMs (LLaMA, PaLM, Mistral) use Rotary Positional Embeddings. As detailed in the CS336 slides, RoPE works by mathematically rotating the Query and Key vectors based on their absolute position. This method has proven exceptionally effective for long sequences and is considered the current state-of-the-art.

2. The Feed-Forward Network: An Extreme Experiment + +Link to heading

The Feed-Forward Network (FFN) inside each Transformer block is typically 4 times the model’s hidden dimension (d_model). The original T5 11B model took a radical departure from this rule.

  • T5 11B’s Choice: It used a small hidden dimension (d_model = 1024) but an astoundingly large FFN dimension (d_ff = 65,536), a 64-times multiplier. The rationale was that modern accelerators (like Google’s TPUs) are highly efficient at large, dense matrix multiplications.
  • The Modern Standard: This experiment was not widely adopted. Later models, including T5’s own successor T5 v1.1, reverted to the standard 4x multiplier (or ~2.66x when using GLU activations) for a better balance of parameters and performance.

3. Denoising: Span Corruption vs. Iterative Diffusion + +Link to heading

While T5’s pre-training is called “denoising,” it’s conceptually different from the denoising in modern diffusion models.

  • T5’s Denoising: This is span corruption. The model is shown a sentence with chunks of text masked out and learns to predict exactly what was removed in a single step. It’s a fill-in-the-blanks task to learn rich language representations.
  • Diffusion Denoising: This is a multi-step generative process. A clean text is gradually corrupted with noise, and the model learns to reverse this process step-by-step, allowing it to generate high-fidelity text from pure noise.

Where T5 Was Ahead of its Time + +Link to heading

Despite its differences, the “T5 v1.1” variant pioneered several techniques that are now standard practice in the most advanced LLMs:

  • RMSNorm: It was one of the first major models to adopt Root Mean Square Normalization instead of LayerNorm, a choice now used by LLaMA, Mistral, and others for its efficiency and stability.
  • Pre-Normalization: T5 applies the normalization layer before the attention and FFN blocks, a critical technique for enabling stable training of very deep networks.
  • No Bias Terms: T5 v1.1 removed the bias parameters from its normalization and FFN layers, a small but important optimization for memory and stability that modern models follow.
  • Gated Activations (GeGLU): While the original T5 used ReLU, T5 v1.1 adopted a Gated Linear Unit (GeGLU), presaging the move to GLU-family activations (like SwiGLU) that is now ubiquitous.

Conclusion: The Lasting Legacy + +Link to heading

T5 represents a different evolutionary branch in the Transformer family tree. While the field has largely converged on the decoder-only architecture for its scalability in general-purpose models, T5’s design remains a masterclass in purpose-built engineering.

Its text-to-text framework was revolutionary, its encoder-decoder structure is still a go-to for tasks like translation, and its refined T5 v1.1 architecture laid the groundwork for many of the stability and efficiency tricks we see in today’s state-of-the-art models. T5 is more than just a model; it’s a crucial case study in the architectural trade-offs that continue to shape the future of artificial intelligence.

\ No newline at end of file diff --git a/posts/useful/index.html b/posts/useful/index.html index ff71fe0..e91514e 100644 --- a/posts/useful/index.html +++ b/posts/useful/index.html @@ -10,4 +10,4 @@ One-minute read
  • [a3ccac4] \ No newline at end of file +[b6192ca] \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index da6e60d..1399b9b 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -1 +1 @@ -/2025-08-02T15:46:24-07:00weekly0.5/posts/2025-08-02T15:46:24-07:00weekly0.5/posts/useful/2020-10-26T04:47:36+00:00weekly0.5/about/2020-06-16T23:30:17-07:00weekly0.5/posts/a-deep-dive-into-ppo-for-language-models/2025-08-02T15:46:24-07:00weekly0.5/categories/weekly0.5/tags/weekly0.5 \ No newline at end of file +/2025-08-03T01:47:39+00:00weekly0.5/posts/2025-08-03T01:47:39+00:00weekly0.5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/2025-08-03T01:47:39+00:00weekly0.5/posts/useful/2020-10-26T04:47:36+00:00weekly0.5/about/2020-06-16T23:30:17-07:00weekly0.5/posts/a-deep-dive-into-ppo-for-language-models/2025-08-02T15:46:24-07:00weekly0.5/categories/weekly0.5/tags/weekly0.5 \ No newline at end of file diff --git a/tags/index.html b/tags/index.html index 356ce1f..e07b921 100644 --- a/tags/index.html +++ b/tags/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[a3ccac4] \ No newline at end of file +[b6192ca] \ No newline at end of file