29 lines
20 KiB
HTML
29 lines
20 KiB
HTML
<!doctype html><html lang=en><head><title>The Convergence of Fast Weights, Linear Attention, and State Space Models · Eric X. Liu's Personal Page</title><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=color-scheme content="light dark"><meta http-equiv=Content-Security-Policy content="upgrade-insecure-requests; block-all-mixed-content; default-src 'self'; child-src 'self'; font-src 'self' https://fonts.gstatic.com https://cdn.jsdelivr.net/; form-action 'self'; frame-src 'self' https://www.youtube.com; img-src 'self'; object-src 'none'; style-src 'self' 'unsafe-inline' https://fonts.googleapis.com/ https://cdn.jsdelivr.net/; script-src 'self' 'unsafe-inline' https://www.google-analytics.com https://cdn.jsdelivr.net/ https://pagead2.googlesyndication.com https://static.cloudflareinsights.com https://unpkg.com https://ericxliu-me.disqus.com https://disqus.com https://*.disqus.com https://*.disquscdn.com https://unpkg.com; connect-src 'self' https://www.google-analytics.com https://pagead2.googlesyndication.com https://cloudflareinsights.com ws://localhost:1313 ws://localhost:* wss://localhost:*;"><meta name=author content="Eric X. Liu"><meta name=description content="Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (“Fast Weights”) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).
|
||
This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=twitter:card content="summary"><meta name=twitter:title content="The Convergence of Fast Weights, Linear Attention, and State Space Models"><meta name=twitter:description content="Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (“Fast Weights”) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).
|
||
This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV."><meta property="og:url" content="https://ericxliu.me/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="The Convergence of Fast Weights, Linear Attention, and State Space Models"><meta property="og:description" content="Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (“Fast Weights”) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).
|
||
This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2025-12-19T00:00:00+00:00"><meta property="article:modified_time" content="2025-12-19T21:21:55+00:00"><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=canonical href=https://ericxliu.me/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.4b392a85107b91dbdabc528edf014a6ab1a30cd44cafcd5325c8efe796794fca.css integrity="sha256-SzkqhRB7kdvavFKO3wFKarGjDNRMr81TJcjv55Z5T8o=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5><script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-3972604619956476" crossorigin=anonymous></script><script type=application/ld+json>{"@context":"http://schema.org","@type":"Person","name":"Eric X. Liu","url":"https:\/\/ericxliu.me\/","description":"Software \u0026 Performance Engineer at Google","sameAs":["https:\/\/www.linkedin.com\/in\/eric-x-liu-46648b93\/","https:\/\/git.ericxliu.me\/eric"]}</script><script type=application/ld+json>{"@context":"http://schema.org","@type":"BlogPosting","headline":"The Convergence of Fast Weights, Linear Attention, and State Space Models","genre":"Blog","wordcount":"984","url":"https:\/\/ericxliu.me\/posts\/the-convergence-of-fast-weights-linear-attention-and-state-space-models\/","datePublished":"2025-12-19T00:00:00\u002b00:00","dateModified":"2025-12-19T21:21:55\u002b00:00","description":"\u003cp\u003eModern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (\u0026ldquo;Fast Weights\u0026rdquo;) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).\u003c\/p\u003e\n\u003cp\u003eThis article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV.\u003c\/p\u003e","author":{"@type":"Person","name":"Eric X. Liu"}}</script></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=https://ericxliu.me/>Eric X. Liu's Personal Page
|
||
</a><input type=checkbox id=menu-toggle>
|
||
<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=/about/>About</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=https://ericxliu.me/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/>The Convergence of Fast Weights, Linear Attention, and State Space Models</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
|
||
<time datetime=2025-12-19T00:00:00Z>December 19, 2025
|
||
</time></span><span class=reading-time><i class="fa-solid fa-clock" aria-hidden=true></i>
|
||
5-minute read</span></div></div></header><div class=post-content><p>Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (“Fast Weights”) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).</p><p>This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV.</p><h2 id=1-the-standard-transformer-bottleneck>1. The Standard Transformer Bottleneck
|
||
<a class=heading-link href=#1-the-standard-transformer-bottleneck><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h2><p>To understand the motivation for Fast Weights, one must first identify the inefficiency in standard Transformers. The core operation is <strong>Self-Attention</strong>, defined as:</p>$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V $$<p>During inference (generating tokens one by one), the model computes a Query ($Q$) for the current token and compares it against the Keys ($K$) and Values ($V$) of all previous tokens.</p><ul><li><strong>Computational Cost:</strong> Quadratic $O(N^2)$ during training; Linear $O(N)$ per step during inference.</li><li><strong>Memory Cost:</strong> The KV Cache. To calculate the softmax, the model must explicitly store the $K$ and $V$ vectors for the entire history in GPU memory. For long contexts (e.g., 1 million tokens), this memory footprint becomes prohibitive.</li></ul><p>The <strong>Softmax</strong> function is the culprit. It introduces a non-linearity that binds $Q$ and $K$ together, preventing the mathematical separation of the current query from the historical context.</p><h2 id=2-fast-weights-as-associative-memory>2. Fast Weights as Associative Memory
|
||
<a class=heading-link href=#2-fast-weights-as-associative-memory><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h2><p>Geoffrey Hinton proposes that the brain does not maintain a “digital buffer” of past activations (like a KV cache). Instead, it relies on <strong>Fast Weights</strong>.</p><p>In this framework, neural connections possess two timescales:</p><ol><li><strong>Slow Weights:</strong> The standard parameters learned over long periods (training).</li><li><strong>Fast Weights:</strong> Synaptic strengths that change rapidly during a forward pass to store temporary context.</li></ol><p>Hinton formalizes this temporary storage as an <strong>Associative Memory</strong>. When a network encounters a new key-value pair ($k, v$), it does not store the vectors in a list. Instead, it updates a fast weight matrix $W_{fast}$ using the Hebbian learning rule (outer product):</p>$$ W_{fast} \leftarrow \lambda W_{fast} + (v \otimes k) $$<p>Here, $\lambda$ is a decay factor ($0 < \lambda < 1$) representing forgetfulness. This matrix $W_{fast}$ compresses the history into a fixed-size representation of size $d \times d$, regardless of the sequence length.</p><h2 id=3-mathematical-unification-linear-attention>3. Mathematical Unification: Linear Attention
|
||
<a class=heading-link href=#3-mathematical-unification-linear-attention><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h2><p>The connection between Fast Weights and Transformers is established by removing the softmax function from the attention mechanism, a technique known as <strong>Linear Attention</strong>.</p><p>If we treat the interaction between $Q$ and $K$ as linear, the attention equation becomes:</p>$$ \text{LinearAttention} = (Q K^T) V $$<p>Using the associative property of matrix multiplication, we can reorder the operations:</p>$$ Q (K^T V) $$<p>This reordering fundamentally alters the mechanism:</p><ul><li><strong>Left Side $(Q K^T) V$:</strong> Compare Query to all Keys, then multiply by Values. Requires storing history.</li><li><strong>Right Side $Q (K^T V)$:</strong> Compute the summation of Key-Value outer products first.</li></ul><p>The term $(K^T V)$ represents the summation of all past associations. This term <strong>is</strong> the Fast Weight matrix $W_{fast}$ described by Hinton.</p>$$ \text{State}_t = \sum_{i=1}^t k_i v_i^T $$<p>Thus, Linear Attention is effectively a system where the “state” is a matrix of Fast Weights that is updated at every time step.</p><h2 id=4-state-space-models-ssms-as-recurrent-fast-weights>4. State Space Models (SSMs) as Recurrent Fast Weights
|
||
<a class=heading-link href=#4-state-space-models-ssms-as-recurrent-fast-weights><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h2><p>State Space Models (like S4 and Mamba) typically define sequence modeling through continuous control theory, discretized into a recurrence:</p>$$ h_t = \bar{A} h_{t-1} + \bar{B} x_t $$<p></p>$$ y_t = \bar{C} h_t $$<p>While derived differently, this recurrence is mathematically equivalent to the Linear Attention/Fast Weight mechanism. We can demonstrate this by “unrolling” the SSM recursion to see how the output $y_t$ depends on the history.</p><p>The output at time $t$ is the sum of inputs weighted by decaying powers of $\bar{A}$:</p>$$ y_t = \sum_{j=1}^t \bar{C} (\bar{A}^{t-j}) (\bar{B} x_j) $$<p>Comparing this to the Linear Attention formulation with decay $\lambda$:</p>$$ \text{Attention}_t = q_t \sum_{j=1}^t (\lambda^{t-j}) (k_j^T v_j) $$<p>The mapping between architectures becomes clear:</p><ul><li><strong>Query ($q_t$)</strong> $\leftrightarrow$ Output Matrix <strong>$\bar{C}$</strong></li><li><strong>Key/Value ($k_j^T v_j$)</strong> $\leftrightarrow$ Input Matrix <strong>$\bar{B} x_j$</strong> (Input Projection)</li><li><strong>Decay Factor ($\lambda$)</strong> $\leftrightarrow$ State Matrix <strong>$\bar{A}$</strong></li><li><strong>Fast Weight Matrix ($S_t$)</strong> $\leftrightarrow$ Hidden State <strong>$h_t$</strong></li></ul><p>Therefore, an SSM is mechanically a Transformer that uses Fast Weights (a fixed-size recurrent state) rather than a KV Cache (a growing buffer) to handle attention.</p><h2 id=5-implications-for-inference-optimization>5. Implications for Inference Optimization
|
||
<a class=heading-link href=#5-implications-for-inference-optimization><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h2><p>This theoretical convergence has significant implications for inference efficiency.</p><h3 id=standard-transformer>Standard Transformer
|
||
<a class=heading-link href=#standard-transformer><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h3><ul><li><strong>Mechanism:</strong> Stores history in a KV Cache.</li><li><strong>Memory:</strong> $O(N)$ (Grows linearly with sequence length).</li><li><strong>Performance:</strong> High recall/precision because it retains the exact history.</li></ul><h3 id=fast-weight--ssm-mamba--rwkv>Fast Weight / SSM (Mamba / RWKV)
|
||
<a class=heading-link href=#fast-weight--ssm-mamba--rwkv><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h3><ul><li><strong>Mechanism:</strong> Compresses history into a single Matrix/Vector state.</li><li><strong>Memory:</strong> $O(1)$ (Constant memory, regardless of sequence length).</li><li><strong>Performance:</strong> Historically lower than Transformers due to “compression loss” (trying to stuff infinite history into a finite matrix).</li></ul><p><strong>The Solution:</strong> Modern SSMs like Mamba improve upon basic Linear Attention by introducing <strong>Selectivity</strong>. Instead of compressing <em>all</em> history equally (which blurs the memory), Mamba allows the model to dynamically gate the inputs—choosing to store relevant information and reset/forget irrelevant noise. This allows the Fast Weight approach to compete with the accuracy of explicit Attention while maintaining constant memory usage.</p><h3 id=references>References
|
||
<a class=heading-link href=#references><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h3><ol><li><strong>Hinton, G. E., & Plaut, D. C. (1987).</strong> “Using Fast Weights to Deblur Old Memories.” <em>Proceedings of the 9th Annual Conference of the Cognitive Science Society.</em></li><li><strong>Ba, J., Hinton, G. E., et al. (2016).</strong> “Using Fast Weights to Attend to the Recent Past.” <em>Advances in Neural Information Processing Systems (NeurIPS).</em></li><li><strong>Katharopoulos, A., et al. (2020).</strong> “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.” <em>International Conference on Machine Learning (ICML).</em></li><li><strong>Gu, A., & Dao, T. (2023).</strong> “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” <em>arXiv preprint arXiv:2312.00752.</em></li><li><strong>Vaswani, A., et al. (2017).</strong> “Attention Is All You Need.” <em>Advances in Neural Information Processing Systems (NeurIPS).</em></li></ol></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css integrity=sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0 crossorigin=anonymous><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js integrity=sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4 crossorigin=anonymous></script><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js integrity=sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05 crossorigin=anonymous onload='renderMathInElement(document.body,{delimiters:[{left:"$$",right:"$$",display:!0},{left:"$",right:"$",display:!1},{left:"\\(",right:"\\)",display:!1},{left:"\\[",right:"\\]",display:!0}]})'></script></section></div><footer class=footer><section class=container>©
|
||
2016 -
|
||
2025
|
||
Eric X. Liu
|
||
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/9ffc2bb">[9ffc2bb]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html> |