ericxliu-me/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/index.html

<!doctype html><html lang=en><head><title>The Convergence of Fast Weights, Linear Attention, and State Space Models · Eric X. Liu's Personal Page</title><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=color-scheme content="light dark"><meta http-equiv=Content-Security-Policy content="upgrade-insecure-requests; block-all-mixed-content; default-src 'self'; child-src 'self'; font-src 'self' https://fonts.gstatic.com https://cdn.jsdelivr.net/; form-action 'self'; frame-src 'self' https://www.youtube.com https://disqus.com; img-src 'self' https://referrer.disqus.com https://c.disquscdn.com https://*.disqus.com; object-src 'none'; style-src 'self' 'unsafe-inline' https://fonts.googleapis.com/ https://cdn.jsdelivr.net/; script-src 'self' 'unsafe-inline' https://www.google-analytics.com https://cdn.jsdelivr.net/ https://pagead2.googlesyndication.com https://static.cloudflareinsights.com https://unpkg.com https://ericxliu-me.disqus.com https://disqus.com https://*.disqus.com https://*.disquscdn.com https://unpkg.com; connect-src 'self' https://www.google-analytics.com https://pagead2.googlesyndication.com https://cloudflareinsights.com ws://localhost:1313 ws://localhost:* wss://localhost:* https://links.services.disqus.com https://*.disqus.com;"><meta name=author content="Eric X. Liu"><meta name=description content="Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (&ldquo;Fast Weights&rdquo;) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).
This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=twitter:card content="summary"><meta name=twitter:title content="The Convergence of Fast Weights, Linear Attention, and State Space Models"><meta name=twitter:description content="Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (“Fast Weights”) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).
This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV."><meta property="og:url" content="https://ericxliu.me/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="The Convergence of Fast Weights, Linear Attention, and State Space Models"><meta property="og:description" content="Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (“Fast Weights”) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).
This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2025-12-19T00:00:00+00:00"><meta property="article:modified_time" content="2025-12-19T21:21:55+00:00"><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=canonical href=https://ericxliu.me/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.4b392a85107b91dbdabc528edf014a6ab1a30cd44cafcd5325c8efe796794fca.css integrity="sha256-SzkqhRB7kdvavFKO3wFKarGjDNRMr81TJcjv55Z5T8o=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5><script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-3972604619956476" crossorigin=anonymous></script><script type=application/ld+json>{"@context":"http://schema.org","@type":"Person","name":"Eric X. Liu","url":"https:\/\/ericxliu.me\/","description":"Software \u0026 Performance Engineer at Google","sameAs":["https:\/\/www.linkedin.com\/in\/eric-x-liu-46648b93\/","https:\/\/git.ericxliu.me\/eric"]}</script><script type=application/ld+json>{"@context":"http://schema.org","@type":"BlogPosting","headline":"The Convergence of Fast Weights, Linear Attention, and State Space Models","genre":"Blog","wordcount":"984","url":"https:\/\/ericxliu.me\/posts\/the-convergence-of-fast-weights-linear-attention-and-state-space-models\/","datePublished":"2025-12-19T00:00:00\u002b00:00","dateModified":"2025-12-19T21:21:55\u002b00:00","description":"\u003cp\u003eModern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (\u0026ldquo;Fast Weights\u0026rdquo;) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).\u003c\/p\u003e\n\u003cp\u003eThis article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV.\u003c\/p\u003e","author":{"@type":"Person","name":"Eric X. Liu"}}</script></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=https://ericxliu.me/>Eric X. Liu's Personal Page
</a><input type=checkbox id=menu-toggle>
<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=/about/>About</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=https://ericxliu.me/posts/the-convergence-of-fast-weights-linear-attention-and-state-space-models/>The Convergence of Fast Weights, Linear Attention, and State Space Models</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
<time datetime=2025-12-19T00:00:00Z>December 19, 2025
</time></span><span class=reading-time><i class="fa-solid fa-clock" aria-hidden=true></i>
5-minute read</span></div></div></header><div class=post-content><p>Modern Large Language Models (LLMs) are dominated by the Transformer architecture. However, as context windows grow, the computational cost of the Transformer’s attention mechanism has become a primary bottleneck. Recent discussions in the AI community—most notably by Geoffrey Hinton—have highlighted a theoretical link between biological memory mechanisms (&ldquo;Fast Weights&rdquo;) and efficient engineering solutions like Linear Transformers and State Space Models (SSMs).</p><p>This article explores the mathematical equivalence between Hinton’s concept of Fast Weights as Associative Memory and the recurrence mechanisms found in models such as Mamba and RWKV.</p><h2 id=1-the-standard-transformer-bottleneck>1. The Standard Transformer Bottleneck
<a class=heading-link href=#1-the-standard-transformer-bottleneck><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>To understand the motivation for Fast Weights, one must first identify the inefficiency in standard Transformers. The core operation is <strong>Self-Attention</strong>, defined as:</p>$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V $$<p>During inference (generating tokens one by one), the model computes a Query ($Q$) for the current token and compares it against the Keys ($K$) and Values ($V$) of all previous tokens.</p><ul><li><strong>Computational Cost:</strong> Quadratic $O(N^2)$ during training; Linear $O(N)$ per step during inference.</li><li><strong>Memory Cost:</strong> The KV Cache. To calculate the softmax, the model must explicitly store the $K$ and $V$ vectors for the entire history in GPU memory. For long contexts (e.g., 1 million tokens), this memory footprint becomes prohibitive.</li></ul><p>The <strong>Softmax</strong> function is the culprit. It introduces a non-linearity that binds $Q$ and $K$ together, preventing the mathematical separation of the current query from the historical context.</p><h2 id=2-fast-weights-as-associative-memory>2. Fast Weights as Associative Memory
<a class=heading-link href=#2-fast-weights-as-associative-memory><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>Geoffrey Hinton proposes that the brain does not maintain a &ldquo;digital buffer&rdquo; of past activations (like a KV cache). Instead, it relies on <strong>Fast Weights</strong>.</p><p>In this framework, neural connections possess two timescales:</p><ol><li><strong>Slow Weights:</strong> The standard parameters learned over long periods (training).</li><li><strong>Fast Weights:</strong> Synaptic strengths that change rapidly during a forward pass to store temporary context.</li></ol><p>Hinton formalizes this temporary storage as an <strong>Associative Memory</strong>. When a network encounters a new key-value pair ($k, v$), it does not store the vectors in a list. Instead, it updates a fast weight matrix $W_{fast}$ using the Hebbian learning rule (outer product):</p>$$ W_{fast} \leftarrow \lambda W_{fast} + (v \otimes k) $$<p>Here, $\lambda$ is a decay factor ($0 < \lambda < 1$) representing forgetfulness. This matrix $W_{fast}$ compresses the history into a fixed-size representation of size $d \times d$, regardless of the sequence length.</p><h2 id=3-mathematical-unification-linear-attention>3. Mathematical Unification: Linear Attention
<a class=heading-link href=#3-mathematical-unification-linear-attention><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>The connection between Fast Weights and Transformers is established by removing the softmax function from the attention mechanism, a technique known as <strong>Linear Attention</strong>.</p><p>If we treat the interaction between $Q$ and $K$ as linear, the attention equation becomes:</p>$$ \text{LinearAttention} = (Q K^T) V $$<p>Using the associative property of matrix multiplication, we can reorder the operations:</p>$$ Q (K^T V) $$<p>This reordering fundamentally alters the mechanism:</p><ul><li><strong>Left Side $(Q K^T) V$:</strong> Compare Query to all Keys, then multiply by Values. Requires storing history.</li><li><strong>Right Side $Q (K^T V)$:</strong> Compute the summation of Key-Value outer products first.</li></ul><p>The term $(K^T V)$ represents the summation of all past associations. This term <strong>is</strong> the Fast Weight matrix $W_{fast}$ described by Hinton.</p>$$ \text{State}_t = \sum_{i=1}^t k_i v_i^T $$<p>Thus, Linear Attention is effectively a system where the &ldquo;state&rdquo; is a matrix of Fast Weights that is updated at every time step.</p><h2 id=4-state-space-models-ssms-as-recurrent-fast-weights>4. State Space Models (SSMs) as Recurrent Fast Weights
<a class=heading-link href=#4-state-space-models-ssms-as-recurrent-fast-weights><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>State Space Models (like S4 and Mamba) typically define sequence modeling through continuous control theory, discretized into a recurrence:</p>$$ h_t = \bar{A} h_{t-1} + \bar{B} x_t $$<p></p>$$ y_t = \bar{C} h_t $$<p>While derived differently, this recurrence is mathematically equivalent to the Linear Attention/Fast Weight mechanism. We can demonstrate this by &ldquo;unrolling&rdquo; the SSM recursion to see how the output $y_t$ depends on the history.</p><p>The output at time $t$ is the sum of inputs weighted by decaying powers of $\bar{A}$:</p>$$ y_t = \sum_{j=1}^t \bar{C} (\bar{A}^{t-j}) (\bar{B} x_j) $$<p>Comparing this to the Linear Attention formulation with decay $\lambda$:</p>$$ \text{Attention}_t = q_t \sum_{j=1}^t (\lambda^{t-j}) (k_j^T v_j) $$<p>The mapping between architectures becomes clear:</p><ul><li><strong>Query ($q_t$)</strong> $\leftrightarrow$ Output Matrix <strong>$\bar{C}$</strong></li><li><strong>Key/Value ($k_j^T v_j$)</strong> $\leftrightarrow$ Input Matrix <strong>$\bar{B} x_j$</strong> (Input Projection)</li><li><strong>Decay Factor ($\lambda$)</strong> $\leftrightarrow$ State Matrix <strong>$\bar{A}$</strong></li><li><strong>Fast Weight Matrix ($S_t$)</strong> $\leftrightarrow$ Hidden State <strong>$h_t$</strong></li></ul><p>Therefore, an SSM is mechanically a Transformer that uses Fast Weights (a fixed-size recurrent state) rather than a KV Cache (a growing buffer) to handle attention.</p><h2 id=5-implications-for-inference-optimization>5. Implications for Inference Optimization
<a class=heading-link href=#5-implications-for-inference-optimization><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>This theoretical convergence has significant implications for inference efficiency.</p><h3 id=standard-transformer>Standard Transformer
<a class=heading-link href=#standard-transformer><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><ul><li><strong>Mechanism:</strong> Stores history in a KV Cache.</li><li><strong>Memory:</strong> $O(N)$ (Grows linearly with sequence length).</li><li><strong>Performance:</strong> High recall/precision because it retains the exact history.</li></ul><h3 id=fast-weight--ssm-mamba--rwkv>Fast Weight / SSM (Mamba / RWKV)
<a class=heading-link href=#fast-weight--ssm-mamba--rwkv><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><ul><li><strong>Mechanism:</strong> Compresses history into a single Matrix/Vector state.</li><li><strong>Memory:</strong> $O(1)$ (Constant memory, regardless of sequence length).</li><li><strong>Performance:</strong> Historically lower than Transformers due to &ldquo;compression loss&rdquo; (trying to stuff infinite history into a finite matrix).</li></ul><p><strong>The Solution:</strong> Modern SSMs like Mamba improve upon basic Linear Attention by introducing <strong>Selectivity</strong>. Instead of compressing <em>all</em> history equally (which blurs the memory), Mamba allows the model to dynamically gate the inputs—choosing to store relevant information and reset/forget irrelevant noise. This allows the Fast Weight approach to compete with the accuracy of explicit Attention while maintaining constant memory usage.</p><h3 id=references>References
<a class=heading-link href=#references><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><ol><li><strong>Hinton, G. E., & Plaut, D. C. (1987).</strong> &ldquo;Using Fast Weights to Deblur Old Memories.&rdquo; <em>Proceedings of the 9th Annual Conference of the Cognitive Science Society.</em></li><li><strong>Ba, J., Hinton, G. E., et al. (2016).</strong> &ldquo;Using Fast Weights to Attend to the Recent Past.&rdquo; <em>Advances in Neural Information Processing Systems (NeurIPS).</em></li><li><strong>Katharopoulos, A., et al. (2020).</strong> &ldquo;Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.&rdquo; <em>International Conference on Machine Learning (ICML).</em></li><li><strong>Gu, A., & Dao, T. (2023).</strong> &ldquo;Mamba: Linear-Time Sequence Modeling with Selective State Spaces.&rdquo; <em>arXiv preprint arXiv:2312.00752.</em></li><li><strong>Vaswani, A., et al. (2017).</strong> &ldquo;Attention Is All You Need.&rdquo; <em>Advances in Neural Information Processing Systems (NeurIPS).</em></li></ol></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css integrity=sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0 crossorigin=anonymous><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js integrity=sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4 crossorigin=anonymous></script><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js integrity=sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05 crossorigin=anonymous onload='renderMathInElement(document.body,{delimiters:[{left:"$$",right:"$$",display:!0},{left:"$",right:"$",display:!1},{left:"\\(",right:"\\)",display:!1},{left:"\\[",right:"\\]",display:!0}]})'></script></section></div><footer class=footer><section class=container>©
2016 -
2026
Eric X. Liu
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/fe6bf91">[fe6bf91]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>