📚 Auto-publish: Add/update 2 blog posts
Some checks failed
Hugo Publish CI / build-and-deploy (push) Failing after 42s
Some checks failed
Hugo Publish CI / build-and-deploy (push) Failing after 42s
Generated on: Fri Dec 19 21:21:55 UTC 2025 Source: md-personal repository
This commit is contained in:
@@ -0,0 +1,103 @@
|
|||||||
|
---
|
||||||
|
title: "The Convergence of Fast Weights, Linear Attention, and State Space Models"
|
||||||
|
date: 2025-12-19
|
||||||
|
draft: false
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms ("Fast Weights") and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).
|
||||||
|
|
||||||
|
This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV.
|
||||||
|
|
||||||
|
## 1. The Standard Transformer Bottleneck
|
||||||
|
To understand the motivation for Fast Weights, one must first identify the inefficiency in standard Transformers. The core operation is **Self-Attention**, defined as:
|
||||||
|
|
||||||
|
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V $$
|
||||||
|
|
||||||
|
During inference (generating tokens one by one), the model computes a Query ($Q$) for the current token and compares it against the Keys ($K$) and Values ($V$) of all previous tokens.
|
||||||
|
* **Computational Cost:** Quadratic $O(N^2)$ during training; Linear $O(N)$ per step during inference.
|
||||||
|
* **Memory Cost:** The KV Cache. To calculate the softmax, the model must explicitly store the $K$ and $V$ vectors for the entire history in GPU memory. For long contexts (e.g., 1 million tokens), this memory footprint becomes prohibitive.
|
||||||
|
|
||||||
|
The **Softmax** function is the culprit. It introduces a non-linearity that binds $Q$ and $K$ together, preventing the mathematical separation of the current query from the historical context.
|
||||||
|
|
||||||
|
## 2. Fast Weights as Associative Memory
|
||||||
|
Geoffrey Hinton proposes that the brain does not maintain a "digital buffer" of past activations (like a KV cache). Instead, it relies on **Fast Weights**.
|
||||||
|
|
||||||
|
In this framework, neural connections possess two timescales:
|
||||||
|
1. **Slow Weights:** The standard parameters learned over long periods (training).
|
||||||
|
2. **Fast Weights:** Synaptic strengths that change rapidly during a forward pass to store temporary context.
|
||||||
|
|
||||||
|
Hinton formalizes this temporary storage as an **Associative Memory**. When a network encounters a new key-value pair ($k, v$), it does not store the vectors in a list. Instead, it updates a fast weight matrix $W_{fast}$ using the Hebbian learning rule (outer product):
|
||||||
|
|
||||||
|
$$ W_{fast} \leftarrow \lambda W_{fast} + (v \otimes k) $$
|
||||||
|
|
||||||
|
Here, $\lambda$ is a decay factor ($0 < \lambda < 1$) representing forgetfulness. This matrix $W_{fast}$ compresses the history into a fixed-size representation of size $d \times d$, regardless of the sequence length.
|
||||||
|
|
||||||
|
## 3. Mathematical Unification: Linear Attention
|
||||||
|
The connection between Fast Weights and Transformers is established by removing the softmax function from the attention mechanism, a technique known as **Linear Attention**.
|
||||||
|
|
||||||
|
If we treat the interaction between $Q$ and $K$ as linear, the attention equation becomes:
|
||||||
|
|
||||||
|
$$ \text{LinearAttention} = (Q K^T) V $$
|
||||||
|
|
||||||
|
Using the associative property of matrix multiplication, we can reorder the operations:
|
||||||
|
|
||||||
|
$$ Q (K^T V) $$
|
||||||
|
|
||||||
|
This reordering fundamentally alters the mechanism:
|
||||||
|
* **Left Side $(Q K^T) V$:** Compare Query to all Keys, then multiply by Values. Requires storing history.
|
||||||
|
* **Right Side $Q (K^T V)$:** Compute the summation of Key-Value outer products first.
|
||||||
|
|
||||||
|
The term $(K^T V)$ represents the summation of all past associations. This term **is** the Fast Weight matrix $W_{fast}$ described by Hinton.
|
||||||
|
|
||||||
|
$$ \text{State}_t = \sum_{i=1}^t k_i v_i^T $$
|
||||||
|
|
||||||
|
Thus, Linear Attention is effectively a system where the "state" is a matrix of Fast Weights that is updated at every time step.
|
||||||
|
|
||||||
|
## 4. State Space Models (SSMs) as Recurrent Fast Weights
|
||||||
|
State Space Models (like S4 and Mamba) typically define sequence modeling through continuous control theory, discretized into a recurrence:
|
||||||
|
|
||||||
|
$$ h_t = \bar{A} h_{t-1} + \bar{B} x_t $$
|
||||||
|
$$ y_t = \bar{C} h_t $$
|
||||||
|
|
||||||
|
While derived differently, this recurrence is mathematically equivalent to the Linear Attention/Fast Weight mechanism. We can demonstrate this by "unrolling" the SSM recursion to see how the output $y_t$ depends on the history.
|
||||||
|
|
||||||
|
The output at time $t$ is the sum of inputs weighted by decaying powers of $\bar{A}$:
|
||||||
|
|
||||||
|
$$ y_t = \sum_{j=1}^t \bar{C} (\bar{A}^{t-j}) (\bar{B} x_j) $$
|
||||||
|
|
||||||
|
Comparing this to the Linear Attention formulation with decay $\lambda$:
|
||||||
|
|
||||||
|
$$ \text{Attention}_t = q_t \sum_{j=1}^t (\lambda^{t-j}) (k_j^T v_j) $$
|
||||||
|
|
||||||
|
The mapping between architectures becomes clear:
|
||||||
|
* **Query ($q_t$)** $\leftrightarrow$ Output Matrix **$\bar{C}$**
|
||||||
|
* **Key/Value ($k_j^T v_j$)** $\leftrightarrow$ Input Matrix **$\bar{B} x_j$** (Input Projection)
|
||||||
|
* **Decay Factor ($\lambda$)** $\leftrightarrow$ State Matrix **$\bar{A}$**
|
||||||
|
* **Fast Weight Matrix ($S_t$)** $\leftrightarrow$ Hidden State **$h_t$**
|
||||||
|
|
||||||
|
Therefore, an SSM is mechanically a Transformer that uses Fast Weights (a fixed-size recurrent state) rather than a KV Cache (a growing buffer) to handle attention.
|
||||||
|
|
||||||
|
## 5. Implications for Inference Optimization
|
||||||
|
This theoretical convergence has significant implications for inference efficiency.
|
||||||
|
|
||||||
|
### Standard Transformer
|
||||||
|
* **Mechanism:** Stores history in a KV Cache.
|
||||||
|
* **Memory:** $O(N)$ (Grows linearly with sequence length).
|
||||||
|
* **Performance:** High recall/precision because it retains the exact history.
|
||||||
|
|
||||||
|
### Fast Weight / SSM (Mamba / RWKV)
|
||||||
|
* **Mechanism:** Compresses history into a single Matrix/Vector state.
|
||||||
|
* **Memory:** $O(1)$ (Constant memory, regardless of sequence length).
|
||||||
|
* **Performance:** Historically lower than Transformers due to "compression loss" (trying to stuff infinite history into a finite matrix).
|
||||||
|
|
||||||
|
**The Solution:** Modern SSMs like Mamba improve upon basic Linear Attention by introducing **Selectivity**. Instead of compressing *all* history equally (which blurs the memory), Mamba allows the model to dynamically gate the inputs—choosing to store relevant information and reset/forget irrelevant noise. This allows the Fast Weight approach to compete with the accuracy of explicit Attention while maintaining constant memory usage.
|
||||||
|
|
||||||
|
### References
|
||||||
|
|
||||||
|
1. **Hinton, G. E., & Plaut, D. C. (1987).** "Using Fast Weights to Deblur Old Memories." *Proceedings of the 9th Annual Conference of the Cognitive Science Society.*
|
||||||
|
2. **Ba, J., Hinton, G. E., et al. (2016).** "Using Fast Weights to Attend to the Recent Past." *Advances in Neural Information Processing Systems (NeurIPS).*
|
||||||
|
3. **Katharopoulos, A., et al. (2020).** "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." *International Conference on Machine Learning (ICML).*
|
||||||
|
4. **Gu, A., & Dao, T. (2023).** "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." *arXiv preprint arXiv:2312.00752.*
|
||||||
|
5. **Vaswani, A., et al. (2017).** "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS).*
|
||||||
53
content/posts/vattention.md
Normal file
53
content/posts/vattention.md
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
---
|
||||||
|
title: "vAttention"
|
||||||
|
date: 2025-12-08
|
||||||
|
draft: false
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
Large Language Model (LLM) inference is memory-bound, primarily due to the Key-Value (KV) cache—a store of intermediate state that grows linearly with sequence length. Efficient management of this memory is critical for throughput. While **PagedAttention** (popularized by vLLM) became the industry standard by solving memory fragmentation via software, recent research suggests that leveraging the GPU’s native hardware Memory Management Unit (MMU) offers a more performant and portable solution.
|
||||||
|
|
||||||
|
#### The Status Quo: PagedAttention and Software Tables
|
||||||
|
Prior to PagedAttention, systems allocated contiguous memory for the maximum possible context length, leading to severe fragmentation and wasted memory. PagedAttention addressed this by chunking the KV cache into non-contiguous blocks, managed by a software-defined "page table" (the Block Table) [1].
|
||||||
|
|
||||||
|
While effective at reducing fragmentation, this approach introduces significant complexity:
|
||||||
|
* **Kernel Rewriting:** Because the KV cache is no longer contiguous in virtual memory, standard attention kernels (like cuDNN SDPA or vanilla FlashAttention) cannot be used directly. Developers must rewrite kernels to manually dereference block tables [1].
|
||||||
|
* **Software Overhead:** The system must manage virtual-to-physical mapping in user space, duplicating work typically handled by the OS. This adds runtime overhead to the critical path of both the CPU (managing tables) and the GPU (performing lookups) [1].
|
||||||
|
* **Performance Penalties:** PagedAttention-based kernels have been observed to be slower than their non-paged counterparts. For example, vLLM's paged kernel has shown to be up to 2.8x slower than FlashAttention-2 in specific tests [1].
|
||||||
|
|
||||||
|
#### The Hardware-Native Alternative: vAttention
|
||||||
|
**vAttention** proposes returning the responsibility of memory management to the OS and hardware. By utilizing the CUDA Virtual Memory Management (VMM) APIs, it is possible to decouple the allocation of virtual memory from physical memory [1].
|
||||||
|
|
||||||
|
**How it works:**
|
||||||
|
1. **Virtual Contiguity:** The system reserves a large, contiguous range of virtual addresses for the KV cache at request start.
|
||||||
|
2. **Physical Paging:** Physical memory pages are allocated and mapped to this virtual range only on demand (dynamically) as the token sequence grows [1].
|
||||||
|
3. **Hardware Lookups:** Because the GPU sees a contiguous virtual address range, the hardware Translation Lookaside Buffer (TLB) handles the address translation. This allows the use of unmodified, high-performance kernels like FlashAttention-2 or FlashAttention-3 without custom paging logic [1].
|
||||||
|
|
||||||
|
#### Technical Challenges and Solutions
|
||||||
|
Historically, using the GPU native virtual memory for high-frequency token generation faced two major bottlenecks: **Control Plane Latency** and **Page Granularity**.
|
||||||
|
|
||||||
|
**1. Control Plane Latency (The API Bottleneck)**
|
||||||
|
Standard memory allocation (`cudaMalloc`) is monolithic—it allocates virtual and physical memory simultaneously. The more granular driver API, `cuMemMap`, allows separating these steps but involves expensive round-trips to the OS driver. Invoking these APIs synchronously during decoding (which generates one token at a time) would stall the GPU execution pipeline [1].
|
||||||
|
|
||||||
|
To solve this, vAttention utilizes **execution overlap**:
|
||||||
|
* Because LLM decoding is autoregressive and predictable, the system knows exactly when new memory is needed (one token ahead).
|
||||||
|
* The CPU initiates the memory mapping for the *next* token asynchronously while the GPU is still computing the *current* token. By the time the GPU reaches the next step, the TLB and page tables are already updated, effectively hiding the driver latency [1].
|
||||||
|
|
||||||
|
**2. Page Size Granularity (The Fragmentation Bottleneck)**
|
||||||
|
The GPU TLB hierarchy is sensitive to page sizes.
|
||||||
|
* **4KB Pages:** Too small. Mapping gigabytes of KV cache with 4KB pages causes "TLB thrashing," degrading performance.
|
||||||
|
* **2MB Huge Pages:** The standard for CUDA large allocations. However, allocating 2MB for a single token update causes massive internal fragmentation, negating the benefits of dynamic allocation.
|
||||||
|
|
||||||
|
Research identified **64KB** as the optimal page size, offering a balance between TLB efficiency and memory utilization. While standard CUDA APIs default to 2MB, vAttention utilizes modified driver calls to enable 64KB pages, eliminating TLB thrashing without incurring the fragmentation cost of huge pages [1].
|
||||||
|
|
||||||
|
#### Performance and Portability Implications
|
||||||
|
Moving memory management from software (PagedAttention) to hardware (vAttention) yields measurable benefits:
|
||||||
|
|
||||||
|
* **Throughput:** In prefill-heavy workloads, vAttention outperforms PagedAttention-based systems (like vLLM and FlashInfer) by up to 1.23x due to the elimination of software lookup overheads. In decoding, it matches or exceeds the performance of optimized paged kernels [1].
|
||||||
|
* **Portability:** A significant advantage is software compatibility. When NVIDIA released FlashAttention-3 (optimized for Hopper H100 GPUs), it did not initially support PagedAttention. vAttention enabled the immediate use of FlashAttention-3 with dynamic memory support, achieving up to 1.5x higher throughput than PagedAttention-based FlashAttention-2 [1].
|
||||||
|
|
||||||
|
#### Conclusion
|
||||||
|
While PagedAttention solved the critical issue of memory fragmentation in LLMs, it necessitated a complex software abstraction layer. By leveraging low-level CUDA VMM APIs, handling allocations asynchronously to hide driver latency, and optimizing page sizes, it is possible to achieve dynamic memory management using the GPU's native hardware. This restores the illusion of contiguous memory, simplifies kernel development, and improves inference performance.
|
||||||
|
|
||||||
|
### References
|
||||||
|
[1] R. Prabhu et al., "vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention," in *Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '25)*, 2025.
|
||||||
Reference in New Issue
Block a user