From 40a88799eed9a2771b5ccc369babd48cc89f4008 Mon Sep 17 00:00:00 2001 From: eric Date: Sat, 20 Dec 2025 01:50:43 +0000 Subject: [PATCH] deploy: 34aa99a15d7bfea38a0444e0ff9833656989f256 --- 404.html | 2 +- about/index.html | 2 +- categories/index.html | 2 +- index.html | 4 +- index.xml | 11 +++++- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 10 ++--- .../index.html | 2 +- posts/index.html | 10 ++--- posts/index.xml | 11 +++++- .../index.html | 2 +- .../index.html | 2 +- posts/page/2/index.html | 6 ++- posts/ppo-for-language-models/index.html | 2 +- posts/quantization-in-llms/index.html | 2 +- .../index.html | 12 +++--- posts/supabase-deep-dive/index.html | 38 +++++++++---------- .../index.html | 2 +- .../index.html | 29 ++++++++++++++ posts/transformer-s-core-mechanics/index.html | 2 +- .../index.html | 2 +- posts/useful/index.html | 2 +- posts/vattention/index.html | 34 +++++++++++++++++ sitemap.xml | 2 +- tags/index.html | 2 +- 27 files changed, 141 insertions(+), 58 deletions(-) create mode 100644 posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/index.html create mode 100644 posts/vattention/index.html diff --git a/404.html b/404.html index 00187ff..9806b6d 100644 --- a/404.html +++ b/404.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/about/index.html b/about/index.html index a339fc3..5ffcf33 100644 --- a/about/index.html +++ b/about/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/categories/index.html b/categories/index.html index 3178900..6adeb4f 100644 --- a/categories/index.html +++ b/categories/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/index.html b/index.html index ea63338..d514133 100644 --- a/index.html +++ b/index.html @@ -1,7 +1,7 @@ -Eric X. Liu's Personal Page
\ No newline at end of file diff --git a/index.xml b/index.xml index f4f4634..bd16f42 100644 --- a/index.xml +++ b/index.xml @@ -1,4 +1,13 @@ -Eric X. Liu's Personal Page/Recent content on Eric X. Liu's Personal PageHugoenSat, 04 Oct 2025 20:41:50 +0000Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)/posts/benchmarking-llms-on-jetson-orin-nano/Sat, 04 Oct 2025 00:00:00 +0000/posts/benchmarking-llms-on-jetson-orin-nano/<h2 id="introduction"> +Eric X. Liu's Personal Page/Recent content on Eric X. Liu's Personal PageHugoenFri, 19 Dec 2025 21:21:55 +0000The Convergence of Fast Weights, Linear Attention, and State Space Models/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/Fri, 19 Dec 2025 00:00:00 +0000/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/<p>Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (&ldquo;Fast Weights&rdquo;) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).</p> +<p>This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV.</p>vAttention/posts/vattention/Mon, 08 Dec 2025 00:00:00 +0000/posts/vattention/<p>Large Language Model (LLM) inference is memory-bound, primarily due to the Key-Value (KV) cache—a store of intermediate state that grows linearly with sequence length. Efficient management of this memory is critical for throughput. While <strong>PagedAttention</strong> (popularized by vLLM) became the industry standard by solving memory fragmentation via software, recent research suggests that leveraging the GPU’s native hardware Memory Management Unit (MMU) offers a more performant and portable solution.</p> +<h4 id="the-status-quo-pagedattention-and-software-tables"> + The Status Quo: PagedAttention and Software Tables + <a class="heading-link" href="#the-status-quo-pagedattention-and-software-tables"> + <i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"></i> + <span class="sr-only">Link to heading</span> + </a> +</h4> +<p>Prior to PagedAttention, systems allocated contiguous memory for the maximum possible context length, leading to severe fragmentation and wasted memory. PagedAttention addressed this by chunking the KV cache into non-contiguous blocks, managed by a software-defined &ldquo;page table&rdquo; (the Block Table) [1].</p>Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)/posts/benchmarking-llms-on-jetson-orin-nano/Sat, 04 Oct 2025 00:00:00 +0000/posts/benchmarking-llms-on-jetson-orin-nano/<h2 id="introduction"> Introduction <a class="heading-link" href="#introduction"> <i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"></i> diff --git a/posts/benchmarking-llms-on-jetson-orin-nano/index.html b/posts/benchmarking-llms-on-jetson-orin-nano/index.html index 1e74809..ef564ef 100644 --- a/posts/benchmarking-llms-on-jetson-orin-nano/index.html +++ b/posts/benchmarking-llms-on-jetson-orin-nano/index.html @@ -62,4 +62,4 @@ After running 66 inference tests across seven different language models ranging 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/breville-barista-pro-maintenance/index.html b/posts/breville-barista-pro-maintenance/index.html index a5ece95..fd5d4b0 100644 --- a/posts/breville-barista-pro-maintenance/index.html +++ b/posts/breville-barista-pro-maintenance/index.html @@ -25,4 +25,4 @@ Understanding the Two Primary Maintenance Cycles Link to heading The Breville Ba 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html b/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html index 3512f28..2dc2b9a 100644 --- a/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html +++ b/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/index.html @@ -20,4 +20,4 @@ Our overarching philosophy is simple: isolate and change only one variable at a 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/flashing-jetson-orin-nano-in-virtualized-environments/index.html b/posts/flashing-jetson-orin-nano-in-virtualized-environments/index.html index e0b721b..6b5a040 100644 --- a/posts/flashing-jetson-orin-nano-in-virtualized-environments/index.html +++ b/posts/flashing-jetson-orin-nano-in-virtualized-environments/index.html @@ -76,7 +76,7 @@ Flashing NVIDIA Jetson devices remotely presents unique challenges when the host
  1. Created udev rules to automatically move USB network interfaces to the container:
# /etc/udev/rules.d/99-jetson-usb-network.rules
 ACTION=="add", SUBSYSTEM=="net", KERNEL=="enx*", RUN+="/usr/local/bin/handle-jetson-usb-network.sh %k"
 
  1. Created handler script to move interfaces into container namespace:
#!/bin/bash
-INTERFACE=$1
+INTERFACE=$1
 CONTAINER_ID=106
 CONTAINER_PID=$(pct exec $CONTAINER_ID -- pidof systemd | awk '{print $1}')
 ip link set "$INTERFACE" netns "ct$CONTAINER_ID"
@@ -108,7 +108,7 @@ Flashing NVIDIA Jetson devices remotely presents unique challenges when the host
 
 Link to heading
# Create VM
 qm create 200 --name jetson-flash --memory 4096 --cores 4 \
-    --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci
+    --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci
 
 # Set machine type to q35 (required for PCIe passthrough)
 qm set 200 --machine q35
@@ -118,11 +118,11 @@ Flashing NVIDIA Jetson devices remotely presents unique challenges when the host
 
 # Configure disk and cloud-init
 qm set 200 --scsi0 local-lvm:vm-200-disk-0 --boot order=scsi0 \
-    --ide2 local-lvm:cloudinit
+    --ide2 local-lvm:cloudinit
 
 # Configure cloud-init
 qm set 200 --ciuser sdkmanager --cipassword sdkmanager \
-    --ipconfig0 ip=dhcp --sshkeys ~/.ssh/authorized_keys
+    --ipconfig0 ip=dhcp --sshkeys ~/.ssh/authorized_keys
 
 # Add PCI passthrough for USB controller
 qm set 200 --hostpci0 0000:22:00.3,pcie=1
@@ -168,4 +168,4 @@ Flashing NVIDIA Jetson devices remotely presents unique challenges when the host
 2016 -
 2025
 Eric X. Liu
-[6ed1d69]
\ No newline at end of file
+[34aa99a]
\ No newline at end of file
diff --git a/posts/how-rvq-teaches-llms-to-see-and-hear/index.html b/posts/how-rvq-teaches-llms-to-see-and-hear/index.html
index b6900a7..a826c25 100644
--- a/posts/how-rvq-teaches-llms-to-see-and-hear/index.html
+++ b/posts/how-rvq-teaches-llms-to-see-and-hear/index.html
@@ -18,4 +18,4 @@ The answer lies in creating a universal language—a bridge between the continuo
 2016 -
 2025
 Eric X. Liu
-[6ed1d69]
\ No newline at end of file
+[34aa99a]
\ No newline at end of file
diff --git a/posts/index.html b/posts/index.html
index 07fdf01..a85ac6d 100644
--- a/posts/index.html
+++ b/posts/index.html
@@ -1,6 +1,8 @@
 Posts · Eric X. Liu's Personal Page
\ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/index.xml b/posts/index.xml index 897f2b8..1801449 100644 --- a/posts/index.xml +++ b/posts/index.xml @@ -1,4 +1,13 @@ -Posts on Eric X. Liu's Personal Page/posts/Recent content in Posts on Eric X. Liu's Personal PageHugoenSat, 04 Oct 2025 20:41:50 +0000Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)/posts/benchmarking-llms-on-jetson-orin-nano/Sat, 04 Oct 2025 00:00:00 +0000/posts/benchmarking-llms-on-jetson-orin-nano/<h2 id="introduction"> +Posts on Eric X. Liu's Personal Page/posts/Recent content in Posts on Eric X. Liu's Personal PageHugoenFri, 19 Dec 2025 21:21:55 +0000The Convergence of Fast Weights, Linear Attention, and State Space Models/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/Fri, 19 Dec 2025 00:00:00 +0000/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/<p>Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (&ldquo;Fast Weights&rdquo;) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).</p> +<p>This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV.</p>vAttention/posts/vattention/Mon, 08 Dec 2025 00:00:00 +0000/posts/vattention/<p>Large Language Model (LLM) inference is memory-bound, primarily due to the Key-Value (KV) cache—a store of intermediate state that grows linearly with sequence length. Efficient management of this memory is critical for throughput. While <strong>PagedAttention</strong> (popularized by vLLM) became the industry standard by solving memory fragmentation via software, recent research suggests that leveraging the GPU’s native hardware Memory Management Unit (MMU) offers a more performant and portable solution.</p> +<h4 id="the-status-quo-pagedattention-and-software-tables"> + The Status Quo: PagedAttention and Software Tables + <a class="heading-link" href="#the-status-quo-pagedattention-and-software-tables"> + <i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"></i> + <span class="sr-only">Link to heading</span> + </a> +</h4> +<p>Prior to PagedAttention, systems allocated contiguous memory for the maximum possible context length, leading to severe fragmentation and wasted memory. PagedAttention addressed this by chunking the KV cache into non-contiguous blocks, managed by a software-defined &ldquo;page table&rdquo; (the Block Table) [1].</p>Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)/posts/benchmarking-llms-on-jetson-orin-nano/Sat, 04 Oct 2025 00:00:00 +0000/posts/benchmarking-llms-on-jetson-orin-nano/<h2 id="introduction"> Introduction <a class="heading-link" href="#introduction"> <i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"></i> diff --git a/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html b/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html index 7850a49..710e41d 100644 --- a/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html +++ b/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html @@ -44,4 +44,4 @@ The Top-K routing mechanism, as illustrated in the provided ima 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/openwrt-mwan3-wireguard-endpoint-exclusion/index.html b/posts/openwrt-mwan3-wireguard-endpoint-exclusion/index.html index f1058ba..4defc40 100644 --- a/posts/openwrt-mwan3-wireguard-endpoint-exclusion/index.html +++ b/posts/openwrt-mwan3-wireguard-endpoint-exclusion/index.html @@ -98,4 +98,4 @@ When using WireGuard together with MWAN3 on OpenWrt, the tunnel can fail to esta 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/page/2/index.html b/posts/page/2/index.html index 93f3107..3c7932b 100644 --- a/posts/page/2/index.html +++ b/posts/page/2/index.html @@ -1,6 +1,8 @@ Posts · Eric X. Liu's Personal Page
\ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/ppo-for-language-models/index.html b/posts/ppo-for-language-models/index.html index c2021b1..bf61dfb 100644 --- a/posts/ppo-for-language-models/index.html +++ b/posts/ppo-for-language-models/index.html @@ -25,4 +25,4 @@ where δ_t = r_t + γV(s_{t+1}) - V(s_t)

  • γ (gam 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/quantization-in-llms/index.html b/posts/quantization-in-llms/index.html index b8cd9cd..5f4d51d 100644 --- a/posts/quantization-in-llms/index.html +++ b/posts/quantization-in-llms/index.html @@ -7,4 +7,4 @@ 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/secure-boot-dkms-and-mok-on-proxmox-debian/index.html b/posts/secure-boot-dkms-and-mok-on-proxmox-debian/index.html index ac5430b..04675ca 100644 --- a/posts/secure-boot-dkms-and-mok-on-proxmox-debian/index.html +++ b/posts/secure-boot-dkms-and-mok-on-proxmox-debian/index.html @@ -30,11 +30,11 @@ nvidia-smi failed to communicate with the NVIDIA driver modprobe nvidia → “K

Step 2 — Make DKMS sign NVIDIA modules with a MOK Link to heading

Debian already generated a DKMS key at /var/lib/dkms/mok.key. Create an X.509 cert in DER format:

sudo openssl req -new -x509 \
-  -key /var/lib/dkms/mok.key \
-  -out /var/lib/dkms/mok.der \
-  -outform DER \
-  -subj "/CN=DKMS MOK/" \
-  -days 36500
+  -key /var/lib/dkms/mok.key \
+  -out /var/lib/dkms/mok.der \
+  -outform DER \
+  -subj "/CN=DKMS MOK/" \
+  -days 36500
 

Enable DKMS signing:

sudo sed -i 's|^mok_signing_key=.*|mok_signing_key=/var/lib/dkms/mok.key|' /etc/dkms/framework.conf
 sudo sed -i 's|^mok_certificate=.*|mok_certificate=/var/lib/dkms/mok.der|' /etc/dkms/framework.conf
 

Rebuild/install modules (signs them now):

sudo dkms build nvidia/$(modinfo -F version nvidia) -k $(uname -r) --force
@@ -59,4 +59,4 @@ nvidia-smi failed to communicate with the NVIDIA driver modprobe nvidia → “K
 2016 -
 2025
 Eric X. Liu
-[6ed1d69]
\ No newline at end of file
+[34aa99a]
\ No newline at end of file
diff --git a/posts/supabase-deep-dive/index.html b/posts/supabase-deep-dive/index.html
index 1d5b5e2..c63232e 100644
--- a/posts/supabase-deep-dive/index.html
+++ b/posts/supabase-deep-dive/index.html
@@ -18,47 +18,47 @@ Supabase enters this space with a radically different philosophy: transparency.
 
 Link to heading

This is not an optional step. RLS is the cornerstone of Supabase security.

  1. Deny by Default: For any table holding user data, immediately enable RLS. This blocks all access until you explicitly grant it.
ALTER TABLE tasks ENABLE ROW LEVEL SECURITY;
 
  1. Write “Allow” Policies: Create policies based on your user stories. Policies are SQL rules that the database enforces on every single query.
-- Users can see tasks in projects they are a member of.
-CREATE POLICY "Allow read access to tasks in user's projects"
-ON tasks FOR SELECT
-USING (
+CREATE POLICY "Allow read access to tasks in user's projects"
+ON tasks FOR SELECT
+USING (
   EXISTS (
     SELECT 1 FROM project_users
     WHERE project_users.project_id = tasks.project_id
     AND project_users.user_id = auth.uid()
   )
-);
+);
 
--- Users can only insert tasks for themselves.
-CREATE POLICY "Allow users to create their own tasks"
-ON tasks FOR INSERT
-WITH CHECK ( auth.uid() = tasks.assignee_id );
+-- Users can only insert tasks for themselves.
+CREATE POLICY "Allow users to create their own tasks"
+ON tasks FOR INSERT
+WITH CHECK ( auth.uid() = tasks.assignee_id );
 

The auth.uid() function is a special Supabase utility that securely returns the ID of the logged-in user making the request.

Phase 4: The APIs (Data Access) Link to heading

With your data structured and secured, you can now build the access points.

  • For Simple CRUD: Use Supabase’s auto-generated API. It’s convenient, respects all your RLS policies, and is perfect for simple reads and writes on a single table.
const { data, error } = await supabase.from('tasks').select('*');
 
  • For Complex Logic: Use PostgreSQL Functions (RPC). Encapsulate complex JOINs or multi-step transactions into a single, callable function. This reduces network chattiness and keeps your business logic secure on the server.
-- A function to get a task and its project name in one call
-CREATE OR REPLACE FUNCTION get_task_with_project(task_id_input int)
-RETURNS TABLE (task_title text, project_name text) AS $$
-BEGIN
+CREATE OR REPLACE FUNCTION get_task_with_project(task_id_input int)
+RETURNS TABLE (task_title text, project_name text) AS $$
+BEGIN
   RETURN QUERY
     SELECT tasks.title, projects.name
     FROM tasks
     JOIN projects ON tasks.project_id = projects.id
     WHERE tasks.id = task_id_input;
-END;
-$$ LANGUAGE plpgsql;
+END;
+$$ LANGUAGE plpgsql;
 
// Called simply from the frontend
-const { data, error } = await supabase.rpc('get_task_with_project', { task_id_input: 123 });
+const { data, error } = await supabase.rpc('get_task_with_project', { task_id_input: 123 });
 

A Tour of the Core Services Link to heading

Beyond the database, Supabase provides a suite of essential tools.

Authentication Link to heading

A complete user management system that integrates directly with your database. When a user signs up, a corresponding entry is created in the managed auth.users table, which you can then reference in your own tables.

// Sign up a new user and handle social logins with ease
-const { data, error } = await supabase.auth.signUp({ email, password });
+const { data, error } = await supabase.auth.signUp({ email, password });
 const { data, error } = await supabase.auth.signInWithOAuth({ provider: 'github' });
 

Storage Link to heading

A simple, S3-compatible object store for managing files like user avatars or documents. It’s integrated with Postgres and RLS, allowing you to write fine-grained access policies on files and folders (buckets).

// Upload a user avatar to a public 'avatars' bucket
-const { error } = await supabase.storage
+const { error } = await supabase.storage
   .from('avatars')
   .upload(`public/${userId}.png`, file);
 

Edge Functions vs. Database Functions @@ -74,14 +74,14 @@ Supabase enters this space with a radically different philosophy: transparency. Link to heading

  • Use For: Small, JSON-based messages like chat messages, live notifications, activity feeds, and presence indicators (“who’s online”). Use the broadcast feature for ephemeral data like cursor positions that you don’t need to save.
  • Do NOT Use For: Large, continuous data streams. It is not a replacement for WebRTC for video/audio calls. The system is designed for small, infrequent payloads.
const channel = supabase.channel('public:messages');
 
 // Subscribe to new rows in the 'messages' table
-channel
+channel
   .on(
     'postgres_changes',
     { event: 'INSERT', schema: 'public', table: 'messages' },
     (payload) => {
       console.log('New message received!', payload.new);
       // Update your UI here
-    }
+    }
   )
   .subscribe();
 

Final Words of Advice @@ -90,4 +90,4 @@ Supabase enters this space with a radically different philosophy: transparency. 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html index e0a553c..85dfac5 100644 --- a/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html +++ b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html @@ -30,4 +30,4 @@ But to truly understand the field, we must look at the pivotal models that explo 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/index.html b/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/index.html new file mode 100644 index 0000000..50f0b8a --- /dev/null +++ b/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/index.html @@ -0,0 +1,29 @@ +The Convergence of Fast Weights, Linear Attention, and State Space Models · Eric X. Liu's Personal Page

The Convergence of Fast Weights, Linear Attention, and State Space Models

Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (“Fast Weights”) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).

This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV.

1. The Standard Transformer Bottleneck + +Link to heading

To understand the motivation for Fast Weights, one must first identify the inefficiency in standard Transformers. The core operation is Self-Attention, defined as:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V $$

During inference (generating tokens one by one), the model computes a Query ($Q$) for the current token and compares it against the Keys ($K$) and Values ($V$) of all previous tokens.

  • Computational Cost: Quadratic $O(N^2)$ during training; Linear $O(N)$ per step during inference.
  • Memory Cost: The KV Cache. To calculate the softmax, the model must explicitly store the $K$ and $V$ vectors for the entire history in GPU memory. For long contexts (e.g., 1 million tokens), this memory footprint becomes prohibitive.

The Softmax function is the culprit. It introduces a non-linearity that binds $Q$ and $K$ together, preventing the mathematical separation of the current query from the historical context.

2. Fast Weights as Associative Memory + +Link to heading

Geoffrey Hinton proposes that the brain does not maintain a “digital buffer” of past activations (like a KV cache). Instead, it relies on Fast Weights.

In this framework, neural connections possess two timescales:

  1. Slow Weights: The standard parameters learned over long periods (training).
  2. Fast Weights: Synaptic strengths that change rapidly during a forward pass to store temporary context.

Hinton formalizes this temporary storage as an Associative Memory. When a network encounters a new key-value pair ($k, v$), it does not store the vectors in a list. Instead, it updates a fast weight matrix $W_{fast}$ using the Hebbian learning rule (outer product):

$$ W_{fast} \leftarrow \lambda W_{fast} + (v \otimes k) $$

Here, $\lambda$ is a decay factor ($0 < \lambda < 1$) representing forgetfulness. This matrix $W_{fast}$ compresses the history into a fixed-size representation of size $d \times d$, regardless of the sequence length.

3. Mathematical Unification: Linear Attention + +Link to heading

The connection between Fast Weights and Transformers is established by removing the softmax function from the attention mechanism, a technique known as Linear Attention.

If we treat the interaction between $Q$ and $K$ as linear, the attention equation becomes:

$$ \text{LinearAttention} = (Q K^T) V $$

Using the associative property of matrix multiplication, we can reorder the operations:

$$ Q (K^T V) $$

This reordering fundamentally alters the mechanism:

  • Left Side $(Q K^T) V$: Compare Query to all Keys, then multiply by Values. Requires storing history.
  • Right Side $Q (K^T V)$: Compute the summation of Key-Value outer products first.

The term $(K^T V)$ represents the summation of all past associations. This term is the Fast Weight matrix $W_{fast}$ described by Hinton.

$$ \text{State}_t = \sum_{i=1}^t k_i v_i^T $$

Thus, Linear Attention is effectively a system where the “state” is a matrix of Fast Weights that is updated at every time step.

4. State Space Models (SSMs) as Recurrent Fast Weights + +Link to heading

State Space Models (like S4 and Mamba) typically define sequence modeling through continuous control theory, discretized into a recurrence:

$$ h_t = \bar{A} h_{t-1} + \bar{B} x_t $$

$$ y_t = \bar{C} h_t $$

While derived differently, this recurrence is mathematically equivalent to the Linear Attention/Fast Weight mechanism. We can demonstrate this by “unrolling” the SSM recursion to see how the output $y_t$ depends on the history.

The output at time $t$ is the sum of inputs weighted by decaying powers of $\bar{A}$:

$$ y_t = \sum_{j=1}^t \bar{C} (\bar{A}^{t-j}) (\bar{B} x_j) $$

Comparing this to the Linear Attention formulation with decay $\lambda$:

$$ \text{Attention}_t = q_t \sum_{j=1}^t (\lambda^{t-j}) (k_j^T v_j) $$

The mapping between architectures becomes clear:

  • Query ($q_t$) $\leftrightarrow$ Output Matrix $\bar{C}$
  • Key/Value ($k_j^T v_j$) $\leftrightarrow$ Input Matrix $\bar{B} x_j$ (Input Projection)
  • Decay Factor ($\lambda$) $\leftrightarrow$ State Matrix $\bar{A}$
  • Fast Weight Matrix ($S_t$) $\leftrightarrow$ Hidden State $h_t$

Therefore, an SSM is mechanically a Transformer that uses Fast Weights (a fixed-size recurrent state) rather than a KV Cache (a growing buffer) to handle attention.

5. Implications for Inference Optimization + +Link to heading

This theoretical convergence has significant implications for inference efficiency.

Standard Transformer + +Link to heading

  • Mechanism: Stores history in a KV Cache.
  • Memory: $O(N)$ (Grows linearly with sequence length).
  • Performance: High recall/precision because it retains the exact history.

Fast Weight / SSM (Mamba / RWKV) + +Link to heading

  • Mechanism: Compresses history into a single Matrix/Vector state.
  • Memory: $O(1)$ (Constant memory, regardless of sequence length).
  • Performance: Historically lower than Transformers due to “compression loss” (trying to stuff infinite history into a finite matrix).

The Solution: Modern SSMs like Mamba improve upon basic Linear Attention by introducing Selectivity. Instead of compressing all history equally (which blurs the memory), Mamba allows the model to dynamically gate the inputs—choosing to store relevant information and reset/forget irrelevant noise. This allows the Fast Weight approach to compete with the accuracy of explicit Attention while maintaining constant memory usage.

References + +Link to heading

  1. Hinton, G. E., & Plaut, D. C. (1987). “Using Fast Weights to Deblur Old Memories.” Proceedings of the 9th Annual Conference of the Cognitive Science Society.
  2. Ba, J., Hinton, G. E., et al. (2016). “Using Fast Weights to Attend to the Recent Past.” Advances in Neural Information Processing Systems (NeurIPS).
  3. Katharopoulos, A., et al. (2020). “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.” International Conference on Machine Learning (ICML).
  4. Gu, A., & Dao, T. (2023). “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv preprint arXiv:2312.00752.
  5. Vaswani, A., et al. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems (NeurIPS).
\ No newline at end of file diff --git a/posts/transformer-s-core-mechanics/index.html b/posts/transformer-s-core-mechanics/index.html index 5f303dc..ea940cf 100644 --- a/posts/transformer-s-core-mechanics/index.html +++ b/posts/transformer-s-core-mechanics/index.html @@ -36,4 +36,4 @@ In deep learning, a “channel” can be thought of as a feature dimensi 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/unifi-vlan-migration-to-zone-based-architecture/index.html b/posts/unifi-vlan-migration-to-zone-based-architecture/index.html index 8529598..9b69faa 100644 --- a/posts/unifi-vlan-migration-to-zone-based-architecture/index.html +++ b/posts/unifi-vlan-migration-to-zone-based-architecture/index.html @@ -28,4 +28,4 @@ This article documents that journey. It details the pitfalls encountered, the co 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/useful/index.html b/posts/useful/index.html index ba350fe..145a290 100644 --- a/posts/useful/index.html +++ b/posts/useful/index.html @@ -9,4 +9,4 @@ One-minute read

  • [6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file diff --git a/posts/vattention/index.html b/posts/vattention/index.html new file mode 100644 index 0000000..89c1203 --- /dev/null +++ b/posts/vattention/index.html @@ -0,0 +1,34 @@ +vAttention · Eric X. Liu's Personal Page

    vAttention

    Large Language Model (LLM) inference is memory-bound, primarily due to the Key-Value (KV) cache—a store of intermediate state that grows linearly with sequence length. Efficient management of this memory is critical for throughput. While PagedAttention (popularized by vLLM) became the industry standard by solving memory fragmentation via software, recent research suggests that leveraging the GPU’s native hardware Memory Management Unit (MMU) offers a more performant and portable solution.

    The Status Quo: PagedAttention and Software Tables + +Link to heading

    Prior to PagedAttention, systems allocated contiguous memory for the maximum possible context length, leading to severe fragmentation and wasted memory. PagedAttention addressed this by chunking the KV cache into non-contiguous blocks, managed by a software-defined “page table” (the Block Table) [1].

    While effective at reducing fragmentation, this approach introduces significant complexity:

    • Kernel Rewriting: Because the KV cache is no longer contiguous in virtual memory, standard attention kernels (like cuDNN SDPA or vanilla FlashAttention) cannot be used directly. Developers must rewrite kernels to manually dereference block tables [1].
    • Software Overhead: The system must manage virtual-to-physical mapping in user space, duplicating work typically handled by the OS. This adds runtime overhead to the critical path of both the CPU (managing tables) and the GPU (performing lookups) [1].
    • Performance Penalties: PagedAttention-based kernels have been observed to be slower than their non-paged counterparts. For example, vLLM’s paged kernel has shown to be up to 2.8x slower than FlashAttention-2 in specific tests [1].

    The Hardware-Native Alternative: vAttention + +Link to heading

    vAttention proposes returning the responsibility of memory management to the OS and hardware. By utilizing the CUDA Virtual Memory Management (VMM) APIs, it is possible to decouple the allocation of virtual memory from physical memory [1].

    How it works:

    1. Virtual Contiguity: The system reserves a large, contiguous range of virtual addresses for the KV cache at request start.
    2. Physical Paging: Physical memory pages are allocated and mapped to this virtual range only on demand (dynamically) as the token sequence grows [1].
    3. Hardware Lookups: Because the GPU sees a contiguous virtual address range, the hardware Translation Lookaside Buffer (TLB) handles the address translation. This allows the use of unmodified, high-performance kernels like FlashAttention-2 or FlashAttention-3 without custom paging logic [1].

    Technical Challenges and Solutions + +Link to heading

    Historically, using the GPU native virtual memory for high-frequency token generation faced two major bottlenecks: Control Plane Latency and Page Granularity.

    1. Control Plane Latency (The API Bottleneck) +Standard memory allocation (cudaMalloc) is monolithic—it allocates virtual and physical memory simultaneously. The more granular driver API, cuMemMap, allows separating these steps but involves expensive round-trips to the OS driver. Invoking these APIs synchronously during decoding (which generates one token at a time) would stall the GPU execution pipeline [1].

    To solve this, vAttention utilizes execution overlap:

    • Because LLM decoding is autoregressive and predictable, the system knows exactly when new memory is needed (one token ahead).
    • The CPU initiates the memory mapping for the next token asynchronously while the GPU is still computing the current token. By the time the GPU reaches the next step, the TLB and page tables are already updated, effectively hiding the driver latency [1].

    2. Page Size Granularity (The Fragmentation Bottleneck) +The GPU TLB hierarchy is sensitive to page sizes.

    • 4KB Pages: Too small. Mapping gigabytes of KV cache with 4KB pages causes “TLB thrashing,” degrading performance.
    • 2MB Huge Pages: The standard for CUDA large allocations. However, allocating 2MB for a single token update causes massive internal fragmentation, negating the benefits of dynamic allocation.

    Research identified 64KB as the optimal page size, offering a balance between TLB efficiency and memory utilization. While standard CUDA APIs default to 2MB, vAttention utilizes modified driver calls to enable 64KB pages, eliminating TLB thrashing without incurring the fragmentation cost of huge pages [1].

    Performance and Portability Implications + +Link to heading

    Moving memory management from software (PagedAttention) to hardware (vAttention) yields measurable benefits:

    • Throughput: In prefill-heavy workloads, vAttention outperforms PagedAttention-based systems (like vLLM and FlashInfer) by up to 1.23x due to the elimination of software lookup overheads. In decoding, it matches or exceeds the performance of optimized paged kernels [1].
    • Portability: A significant advantage is software compatibility. When NVIDIA released FlashAttention-3 (optimized for Hopper H100 GPUs), it did not initially support PagedAttention. vAttention enabled the immediate use of FlashAttention-3 with dynamic memory support, achieving up to 1.5x higher throughput than PagedAttention-based FlashAttention-2 [1].

    Conclusion + +Link to heading

    While PagedAttention solved the critical issue of memory fragmentation in LLMs, it necessitated a complex software abstraction layer. By leveraging low-level CUDA VMM APIs, handling allocations asynchronously to hide driver latency, and optimizing page sizes, it is possible to achieve dynamic memory management using the GPU’s native hardware. This restores the illusion of contiguous memory, simplifies kernel development, and improves inference performance.

    References + +Link to heading

    [1] R. Prabhu et al., “vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention,” in Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘25), 2025.

    \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index fa7983c..db85673 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -1 +1 @@ -/2025-10-04T20:41:50+00:00weekly0.5/posts/2025-10-04T20:41:50+00:00weekly0.5/posts/benchmarking-llms-on-jetson-orin-nano/2025-10-04T20:41:50+00:00weekly0.5/posts/flashing-jetson-orin-nano-in-virtualized-environments/2025-10-02T08:42:39+00:00weekly0.5/posts/openwrt-mwan3-wireguard-endpoint-exclusion/2025-10-02T08:34:05+00:00weekly0.5/posts/unifi-vlan-migration-to-zone-based-architecture/2025-10-02T08:42:39+00:00weekly0.5/posts/quantization-in-llms/2025-08-20T06:02:35+00:00weekly0.5/posts/breville-barista-pro-maintenance/2025-08-20T06:04:36+00:00weekly0.5/posts/secure-boot-dkms-and-mok-on-proxmox-debian/2025-08-14T06:50:22+00:00weekly0.5/posts/how-rvq-teaches-llms-to-see-and-hear/2025-08-08T17:36:52+00:00weekly0.5/posts/supabase-deep-dive/2025-08-04T03:59:37+00:00weekly0.5/posts/ppo-for-language-models/2025-10-02T08:42:39+00:00weekly0.5/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/2025-08-03T06:02:48+00:00weekly0.5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/2025-08-03T03:41:10+00:00weekly0.5/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/2025-08-03T04:20:20+00:00weekly0.5/posts/transformer-s-core-mechanics/2025-10-02T08:42:39+00:00weekly0.5/posts/useful/2025-08-03T08:37:28-07:00weekly0.5/about/2020-06-16T23:30:17-07:00weekly0.5/categories/weekly0.5/tags/weekly0.5 \ No newline at end of file +/2025-12-19T21:21:55+00:00weekly0.5/posts/2025-12-19T21:21:55+00:00weekly0.5/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/2025-12-19T21:21:55+00:00weekly0.5/posts/vattention/2025-12-19T21:21:55+00:00weekly0.5/posts/benchmarking-llms-on-jetson-orin-nano/2025-10-04T20:41:50+00:00weekly0.5/posts/flashing-jetson-orin-nano-in-virtualized-environments/2025-10-02T08:42:39+00:00weekly0.5/posts/openwrt-mwan3-wireguard-endpoint-exclusion/2025-10-02T08:34:05+00:00weekly0.5/posts/unifi-vlan-migration-to-zone-based-architecture/2025-10-02T08:42:39+00:00weekly0.5/posts/quantization-in-llms/2025-08-20T06:02:35+00:00weekly0.5/posts/breville-barista-pro-maintenance/2025-08-20T06:04:36+00:00weekly0.5/posts/secure-boot-dkms-and-mok-on-proxmox-debian/2025-08-14T06:50:22+00:00weekly0.5/posts/how-rvq-teaches-llms-to-see-and-hear/2025-08-08T17:36:52+00:00weekly0.5/posts/supabase-deep-dive/2025-08-04T03:59:37+00:00weekly0.5/posts/ppo-for-language-models/2025-10-02T08:42:39+00:00weekly0.5/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/2025-08-03T06:02:48+00:00weekly0.5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/2025-08-03T03:41:10+00:00weekly0.5/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/2025-08-03T04:20:20+00:00weekly0.5/posts/transformer-s-core-mechanics/2025-10-02T08:42:39+00:00weekly0.5/posts/useful/2025-08-03T08:37:28-07:00weekly0.5/about/2020-06-16T23:30:17-07:00weekly0.5/categories/weekly0.5/tags/weekly0.5 \ No newline at end of file diff --git a/tags/index.html b/tags/index.html index e76e718..b0911e1 100644 --- a/tags/index.html +++ b/tags/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[6ed1d69] \ No newline at end of file +[34aa99a] \ No newline at end of file