From c8813b97f39f5706e4c3743a820dbf5a14199b3b Mon Sep 17 00:00:00 2001 From: eric Date: Sat, 9 Aug 2025 04:06:01 +0000 Subject: [PATCH] deploy: c9ed800d9fce2a440a321cc022c16bbfbc939c18 --- 404.html | 2 +- about/index.html | 2 +- categories/index.html | 2 +- index.html | 2 +- index.xml | 3 ++- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 21 +++++++++++++++++++ posts/index.html | 5 +++-- posts/index.xml | 3 ++- .../index.html | 2 +- posts/supabase-deep-dive/index.html | 2 +- .../index.html | 2 +- posts/useful/index.html | 2 +- sitemap.xml | 2 +- tags/index.html | 2 +- 16 files changed, 40 insertions(+), 16 deletions(-) create mode 100644 posts/how-rvq-teaches-llms-to-see-and-hear/index.html diff --git a/404.html b/404.html index faef608..142bdf9 100644 --- a/404.html +++ b/404.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/about/index.html b/about/index.html index ec32479..740abc2 100644 --- a/about/index.html +++ b/about/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/categories/index.html b/categories/index.html index ac1cac6..0a5af57 100644 --- a/categories/index.html +++ b/categories/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/index.html b/index.html index ae1a4e5..d06cd49 100644 --- a/index.html +++ b/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/index.xml b/index.xml index f48f163..5fd3f5e 100644 --- a/index.xml +++ b/index.xml @@ -1,4 +1,5 @@ -Eric X. Liu's Personal Page/Recent content on Eric X. Liu's Personal PageHugoenMon, 04 Aug 2025 03:59:37 +0000Supabase Deep Dive: It's Not Magic, It's Just Postgres/posts/supabase-deep-dive/Sun, 03 Aug 2025 00:00:00 +0000/posts/supabase-deep-dive/<p>In the world of Backend-as-a-Service (BaaS), platforms are often treated as magic boxes. You push data in, you get data out, and you hope the magic inside scales. While this simplicity is powerful, it can obscure the underlying mechanics, leaving developers wondering what&rsquo;s really going on.</p> +Eric X. Liu's Personal Page/Recent content on Eric X. Liu's Personal PageHugoenFri, 08 Aug 2025 17:36:52 +0000Beyond Words: How RVQ Teaches LLMs to See and Hear/posts/how-rvq-teaches-llms-to-see-and-hear/Thu, 07 Aug 2025 00:00:00 +0000/posts/how-rvq-teaches-llms-to-see-and-hear/<p>Large Language Models (LLMs) are masters of text, but the world is not made of text alone. It’s a symphony of sights, sounds, and experiences. The ultimate goal for AI is to understand this rich, multi-modal world as we do. But how do you teach a model that thinks in words to understand a picture of a sunset or the melody of a song?</p> +<p>The answer lies in creating a universal language—a bridge between the continuous, messy world of pixels and audio waves and the discrete, structured world of language tokens. One of the most elegant and powerful tools for building this bridge is <strong>Residual Vector Quantization (RVQ)</strong>.</p>Supabase Deep Dive: It's Not Magic, It's Just Postgres/posts/supabase-deep-dive/Sun, 03 Aug 2025 00:00:00 +0000/posts/supabase-deep-dive/<p>In the world of Backend-as-a-Service (BaaS), platforms are often treated as magic boxes. You push data in, you get data out, and you hope the magic inside scales. While this simplicity is powerful, it can obscure the underlying mechanics, leaving developers wondering what&rsquo;s really going on.</p> <p>Supabase enters this space with a radically different philosophy: <strong>transparency</strong>. It provides the convenience of a BaaS, but it’s built on the world&rsquo;s most trusted relational database: PostgreSQL. The &ldquo;magic&rdquo; isn&rsquo;t a proprietary black box; it&rsquo;s a carefully assembled suite of open-source tools that enhance Postgres, not hide it.</p>A Deep Dive into PPO for Language Models/posts/a-deep-dive-into-ppo-for-language-models/Sat, 02 Aug 2025 00:00:00 +0000/posts/a-deep-dive-into-ppo-for-language-models/<p>Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don&rsquo;t inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).</p> <p>You may have seen diagrams like the one below, which outlines the RLHF training process. It can look intimidating, with a web of interconnected models, losses, and data flows.</p>Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/Wed, 02 Jul 2025 00:00:00 +0000/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/<p>Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called &ldquo;experts&rdquo;) to specialize in different types of inputs. A &ldquo;gating network&rdquo; or &ldquo;router&rdquo; learns to dispatch each input (or &ldquo;token&rdquo;) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.</p> <h3 id="1-challenge-non-differentiability-of-routing-functions"> diff --git a/posts/a-deep-dive-into-ppo-for-language-models/index.html b/posts/a-deep-dive-into-ppo-for-language-models/index.html index 2ea933e..22c1c0d 100644 --- a/posts/a-deep-dive-into-ppo-for-language-models/index.html +++ b/posts/a-deep-dive-into-ppo-for-language-models/index.html @@ -23,4 +23,4 @@ where δ_t = r_t + γV(s_{t+1}) - V(s_t)

  • γ (gam 2016 - 2025 Eric X. Liu -[c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html b/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html index fb1e0cf..b8af4f7 100644 --- a/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html +++ b/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html @@ -20,4 +20,4 @@ Our overarching philosophy is simple: isolate and change only one variable at a 2016 - 2025 Eric X. Liu -[c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/posts/how-rvq-teaches-llms-to-see-and-hear/index.html b/posts/how-rvq-teaches-llms-to-see-and-hear/index.html new file mode 100644 index 0000000..24353e2 --- /dev/null +++ b/posts/how-rvq-teaches-llms-to-see-and-hear/index.html @@ -0,0 +1,21 @@ +Beyond Words: How RVQ Teaches LLMs to See and Hear · Eric X. Liu's Personal Page

    Beyond Words: How RVQ Teaches LLMs to See and Hear

    Large Language Models (LLMs) are masters of text, but the world is not made of text alone. It’s a symphony of sights, sounds, and experiences. The ultimate goal for AI is to understand this rich, multi-modal world as we do. But how do you teach a model that thinks in words to understand a picture of a sunset or the melody of a song?

    The answer lies in creating a universal language—a bridge between the continuous, messy world of pixels and audio waves and the discrete, structured world of language tokens. One of the most elegant and powerful tools for building this bridge is Residual Vector Quantization (RVQ).

    This article dives deep into RVQ, exploring how it turns raw data into meaningful semantic IDs and how these IDs, in turn, unlock multi-modal understanding in LLMs.

    What is Residual Vector Quantization? The Art of Smart Compression + +Link to heading

    At its core, Vector Quantization (VQ) is a compression technique. It maps a high-dimensional vector (like an data embedding) to the single closest vector in a predefined dictionary, called a codebook. You then only need to store the index of that chosen vector. The problem? To represent complex data accurately, you’d need a codebook with an astronomical number of entries, which is computationally impossible.

    This is where Residual Vector Quantization shines. Instead of one giant codebook, RVQ uses a series of smaller codebooks in stages.

    1. Stage 1 (Coarse Quantization): The input vector is quantized by the first codebook. This finds the broadest, most general category for the data.
    2. Calculate the Residual: The system calculates the error, or “residual,” between the original vector and its quantized version from Stage 1. This residual vector represents the information that was lost in the first coarse approximation.
    3. Stage 2 (Refinement): This residual vector is then quantized by the second codebook. This stage doesn’t re-evaluate the whole vector, but only focuses on correcting the error from the previous stage.
    4. Iterate: This process repeats for several stages, with each subsequent codebook quantizing the residual error from the previous one, adding a finer and finer layer of detail.

    The final compressed representation is simply the sequence of indices from each codebook. For example, an ID like [8, 5, 4, 1] is produced. The magic of this approach is that it creates a hierarchical ID. The first digit [8] might represent “Sports,” the next [5] refines it to “Court Sports,” [4] to “Beach Volleyball,” and the final [1] distinguishes a specific match. Videos with similar content will naturally share a longer prefix in their Semantic ID.

    Learning What Matters: The Trainable VQ-Autoencoder + +Link to heading

    A key insight is that RVQ is not a fixed algorithm but a trainable neural network component. Its codebooks are not predefined; they are learned. This learning happens within a Vector-Quantized Autoencoder (VQ-AE) architecture.

    1. Encoder: A powerful neural network (e.g., a Transformer or CNN) takes the raw data (like video frames and audio) and converts it into a continuous semantic embedding.
    2. RVQ Bottleneck: This embedding is fed into the RVQ module, which quantizes it into the sequence of discrete IDs.
    3. Decoder: The decoder takes these discrete IDs, looks up the corresponding codebook vectors, sums them up to get a reconstructed embedding, and attempts to rebuild the original video/audio.

    The entire system is trained end-to-end. The reconstruction loss (the difference between the original and reconstructed data) is used to update the parameters of the Encoder, the Decoder, and, most importantly, the codebook vectors within the RVQ module. Initially random, the codebook vectors are gradually pushed to become meaningful “anchors” for the core concepts present in the training data.

    From Implicit to Explicit: Controlling Semantics with Contrastive Learning + +Link to heading

    A standard VQ-AE learns implicit semantics. It gets good at reconstruction, but we can’t control what concepts it learns. To make the Semantic IDs truly meaningful and aligned with human language, we introduce contrastive learning.

    The architecture is enhanced with a parallel text encoder (like BERT or CLIP’s). The model is then trained with a joint loss function:

    L_total = L_reconstruction + λ * L_contrastive

    • Reconstruction Loss ensures the RVQ codes contain enough information to rebuild the input.
    • Contrastive Loss forces the media embedding (from the video/audio encoder) to be mathematically “close” to the text embedding of its description, and “far” from the embeddings of unrelated text descriptions.

    This dual goal forces the model to organize its embedding space according to the semantics of human language. The codebook vectors now learn to represent concepts that are not just useful for reconstruction, but are also tied to explicit textual descriptions.

    Integrating with LLMs: Two Powerful Paths to Multi-Modality + +Link to heading

    Once we have a contrastively-trained VQ-AE, we can use its output to give LLMs the ability to see and hear. There are two primary strategies for this.

    Path 1: The Tokenizer Approach - Teaching the LLM a New Language

    This path treats the RVQ IDs as a new vocabulary. It’s a two-stage process ideal for high-fidelity content generation.

    1. Create a Neural Codec: The trained VQ-AE serves as a powerful “codec.” You can take any piece of media (e.g., a song) and use the codec to compress it into a sequence of discrete RVQ tokens (e.g., [8, 5, 4, 1, 8, 5, 9, 2, ...]).
    2. Train a Generative LLM: A new Transformer model is trained auto-regressively on a massive dataset of these media-derived tokens. Its sole purpose is to learn the patterns and predict the next token in a sequence.

    Use Case: This is the architecture behind models like Meta’s MusicGen. A user provides a text prompt, which conditions the Transformer to generate a new sequence of RVQ tokens. These tokens are then fed to the VQ-AE’s decoder to synthesize the final audio waveform.

    Path 2: The Adapter Approach - Translating for a Language Expert

    This path is used to augment a powerful, pre-trained, text-only LLM without the astronomical cost of retraining it.

    1. Freeze the LLM: A massive, pre-trained LLM (like LLaMA) is frozen. Its deep language understanding is preserved.
    2. Use the Pre-Quantized Embedding: Instead of using the discrete RVQ tokens, we take the rich, continuous embedding vector produced by our media encoder just before it enters the RVQ module.
    3. Train a Small Adapter: A small, lightweight projection layer (or “adapter”) is trained. Its only job is to translate the media embedding into a vector that has the same format and structure as the LLM’s own word embeddings. It learns to map visual concepts to their corresponding “word” concepts in the LLM’s latent space.

    Use Case: This is the principle behind models like Google’s Flamingo. To answer a question about an image, the image is passed through the media encoder and adapter. The resulting “vision-as-a-word” vector is inserted into the prompt sequence alongside the text tokens. The frozen LLM can now “reason” about the visual input because it has been translated into a format it already understands.

    \ No newline at end of file diff --git a/posts/index.html b/posts/index.html index 1ebd185..27554ca 100644 --- a/posts/index.html +++ b/posts/index.html @@ -1,6 +1,7 @@ Posts · Eric X. Liu's Personal Page
    \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/posts/index.xml b/posts/index.xml index 9bd04c8..4f6b254 100644 --- a/posts/index.xml +++ b/posts/index.xml @@ -1,4 +1,5 @@ -Posts on Eric X. Liu's Personal Page/posts/Recent content in Posts on Eric X. Liu's Personal PageHugoenMon, 04 Aug 2025 03:59:37 +0000Supabase Deep Dive: It's Not Magic, It's Just Postgres/posts/supabase-deep-dive/Sun, 03 Aug 2025 00:00:00 +0000/posts/supabase-deep-dive/<p>In the world of Backend-as-a-Service (BaaS), platforms are often treated as magic boxes. You push data in, you get data out, and you hope the magic inside scales. While this simplicity is powerful, it can obscure the underlying mechanics, leaving developers wondering what&rsquo;s really going on.</p> +Posts on Eric X. Liu's Personal Page/posts/Recent content in Posts on Eric X. Liu's Personal PageHugoenFri, 08 Aug 2025 17:36:52 +0000Beyond Words: How RVQ Teaches LLMs to See and Hear/posts/how-rvq-teaches-llms-to-see-and-hear/Thu, 07 Aug 2025 00:00:00 +0000/posts/how-rvq-teaches-llms-to-see-and-hear/<p>Large Language Models (LLMs) are masters of text, but the world is not made of text alone. It’s a symphony of sights, sounds, and experiences. The ultimate goal for AI is to understand this rich, multi-modal world as we do. But how do you teach a model that thinks in words to understand a picture of a sunset or the melody of a song?</p> +<p>The answer lies in creating a universal language—a bridge between the continuous, messy world of pixels and audio waves and the discrete, structured world of language tokens. One of the most elegant and powerful tools for building this bridge is <strong>Residual Vector Quantization (RVQ)</strong>.</p>Supabase Deep Dive: It's Not Magic, It's Just Postgres/posts/supabase-deep-dive/Sun, 03 Aug 2025 00:00:00 +0000/posts/supabase-deep-dive/<p>In the world of Backend-as-a-Service (BaaS), platforms are often treated as magic boxes. You push data in, you get data out, and you hope the magic inside scales. While this simplicity is powerful, it can obscure the underlying mechanics, leaving developers wondering what&rsquo;s really going on.</p> <p>Supabase enters this space with a radically different philosophy: <strong>transparency</strong>. It provides the convenience of a BaaS, but it’s built on the world&rsquo;s most trusted relational database: PostgreSQL. The &ldquo;magic&rdquo; isn&rsquo;t a proprietary black box; it&rsquo;s a carefully assembled suite of open-source tools that enhance Postgres, not hide it.</p>A Deep Dive into PPO for Language Models/posts/a-deep-dive-into-ppo-for-language-models/Sat, 02 Aug 2025 00:00:00 +0000/posts/a-deep-dive-into-ppo-for-language-models/<p>Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don&rsquo;t inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).</p> <p>You may have seen diagrams like the one below, which outlines the RLHF training process. It can look intimidating, with a web of interconnected models, losses, and data flows.</p>Mixture-of-Experts (MoE) Models Challenges & Solutions in Practice/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/Wed, 02 Jul 2025 00:00:00 +0000/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/<p>Mixture-of-Experts (MoEs) are neural network architectures that allow different parts of the model (called &ldquo;experts&rdquo;) to specialize in different types of inputs. A &ldquo;gating network&rdquo; or &ldquo;router&rdquo; learns to dispatch each input (or &ldquo;token&rdquo;) to a subset of these experts. While powerful for scaling models, MoEs introduce several practical challenges.</p> <h3 id="1-challenge-non-differentiability-of-routing-functions"> diff --git a/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html b/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html index 4a52517..1f5aeb7 100644 --- a/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html +++ b/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html @@ -44,4 +44,4 @@ The Top-K routing mechanism, as illustrated in the provided ima 2016 - 2025 Eric X. Liu -[c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/posts/supabase-deep-dive/index.html b/posts/supabase-deep-dive/index.html index 0fa251f..891faac 100644 --- a/posts/supabase-deep-dive/index.html +++ b/posts/supabase-deep-dive/index.html @@ -90,4 +90,4 @@ Supabase enters this space with a radically different philosophy: transparency. 2016 - 2025 Eric X. Liu -[c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html index 57dc12a..414888f 100644 --- a/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html +++ b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html @@ -30,4 +30,4 @@ But to truly understand the field, we must look at the pivotal models that explo 2016 - 2025 Eric X. Liu -[c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/posts/useful/index.html b/posts/useful/index.html index e3f6752..aaf65f8 100644 --- a/posts/useful/index.html +++ b/posts/useful/index.html @@ -9,4 +9,4 @@ One-minute read
    • [c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 79d93ac..5c9c927 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -1 +1 @@ -/2025-08-04T03:59:37+00:00weekly0.5/posts/2025-08-04T03:59:37+00:00weekly0.5/posts/supabase-deep-dive/2025-08-04T03:59:37+00:00weekly0.5/posts/a-deep-dive-into-ppo-for-language-models/2025-08-03T03:28:39+00:00weekly0.5/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/2025-08-03T06:02:48+00:00weekly0.5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/2025-08-03T03:41:10+00:00weekly0.5/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/2025-08-03T04:20:20+00:00weekly0.5/posts/useful/2025-08-03T08:37:28-07:00weekly0.5/about/2020-06-16T23:30:17-07:00weekly0.5/categories/weekly0.5/tags/weekly0.5 \ No newline at end of file +/posts/how-rvq-teaches-llms-to-see-and-hear/2025-08-08T17:36:52+00:00weekly0.5/2025-08-08T17:36:52+00:00weekly0.5/posts/2025-08-08T17:36:52+00:00weekly0.5/posts/supabase-deep-dive/2025-08-04T03:59:37+00:00weekly0.5/posts/a-deep-dive-into-ppo-for-language-models/2025-08-03T03:28:39+00:00weekly0.5/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/2025-08-03T06:02:48+00:00weekly0.5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/2025-08-03T03:41:10+00:00weekly0.5/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/2025-08-03T04:20:20+00:00weekly0.5/posts/useful/2025-08-03T08:37:28-07:00weekly0.5/about/2020-06-16T23:30:17-07:00weekly0.5/categories/weekly0.5/tags/weekly0.5 \ No newline at end of file diff --git a/tags/index.html b/tags/index.html index 99c2c27..00fb3a6 100644 --- a/tags/index.html +++ b/tags/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[c25cd89] \ No newline at end of file +[c9ed800] \ No newline at end of file