34 lines
16 KiB
HTML
34 lines
16 KiB
HTML
<!doctype html><html lang=en><head><title>vAttention · Eric X. Liu's Personal Page</title><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=color-scheme content="light dark"><meta name=author content="Eric X. Liu"><meta name=description content="Large Language Model (LLM) inference is memory-bound, primarily due to the Key-Value (KV) cache—a store of intermediate state that grows linearly with sequence length. Efficient management of this memory is critical for throughput. While PagedAttention (popularized by vLLM) became the industry standard by solving memory fragmentation via software, recent research suggests that leveraging the GPU’s native hardware Memory Management Unit (MMU) offers a more performant and portable solution.
|
||
|
||
The Status Quo: PagedAttention and Software Tables
|
||
|
||
|
||
Link to heading
|
||
|
||
|
||
Prior to PagedAttention, systems allocated contiguous memory for the maximum possible context length, leading to severe fragmentation and wasted memory. PagedAttention addressed this by chunking the KV cache into non-contiguous blocks, managed by a software-defined “page table” (the Block Table) [1]."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=twitter:card content="summary"><meta name=twitter:title content="vAttention"><meta name=twitter:description content="Large Language Model (LLM) inference is memory-bound, primarily due to the Key-Value (KV) cache—a store of intermediate state that grows linearly with sequence length. Efficient management of this memory is critical for throughput. While PagedAttention (popularized by vLLM) became the industry standard by solving memory fragmentation via software, recent research suggests that leveraging the GPU’s native hardware Memory Management Unit (MMU) offers a more performant and portable solution.
|
||
The Status Quo: PagedAttention and Software Tables Link to heading Prior to PagedAttention, systems allocated contiguous memory for the maximum possible context length, leading to severe fragmentation and wasted memory. PagedAttention addressed this by chunking the KV cache into non-contiguous blocks, managed by a software-defined “page table” (the Block Table) [1]."><meta property="og:url" content="/posts/vattention/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="vAttention"><meta property="og:description" content="Large Language Model (LLM) inference is memory-bound, primarily due to the Key-Value (KV) cache—a store of intermediate state that grows linearly with sequence length. Efficient management of this memory is critical for throughput. While PagedAttention (popularized by vLLM) became the industry standard by solving memory fragmentation via software, recent research suggests that leveraging the GPU’s native hardware Memory Management Unit (MMU) offers a more performant and portable solution.
|
||
The Status Quo: PagedAttention and Software Tables Link to heading Prior to PagedAttention, systems allocated contiguous memory for the maximum possible context length, leading to severe fragmentation and wasted memory. PagedAttention addressed this by chunking the KV cache into non-contiguous blocks, managed by a software-defined “page table” (the Block Table) [1]."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2025-12-08T00:00:00+00:00"><meta property="article:modified_time" content="2025-12-19T21:21:55+00:00"><link rel=canonical href=/posts/vattention/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.f03d6359cf766772af14fbe07ce6aca734b321c2e15acba0bbf4e2254941c460.css integrity="sha256-8D1jWc92Z3KvFPvgfOaspzSzIcLhWsugu/TiJUlBxGA=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5><script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-3972604619956476" crossorigin=anonymous></script></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=/>Eric X. Liu's Personal Page
|
||
</a><input type=checkbox id=menu-toggle>
|
||
<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=/posts/vattention/>vAttention</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
|
||
<time datetime=2025-12-08T00:00:00Z>December 8, 2025
|
||
</time></span><span class=reading-time><i class="fa-solid fa-clock" aria-hidden=true></i>
|
||
4-minute read</span></div></div></header><div class=post-content><p>Large Language Model (LLM) inference is memory-bound, primarily due to the Key-Value (KV) cache—a store of intermediate state that grows linearly with sequence length. Efficient management of this memory is critical for throughput. While <strong>PagedAttention</strong> (popularized by vLLM) became the industry standard by solving memory fragmentation via software, recent research suggests that leveraging the GPU’s native hardware Memory Management Unit (MMU) offers a more performant and portable solution.</p><h4 id=the-status-quo-pagedattention-and-software-tables>The Status Quo: PagedAttention and Software Tables
|
||
<a class=heading-link href=#the-status-quo-pagedattention-and-software-tables><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h4><p>Prior to PagedAttention, systems allocated contiguous memory for the maximum possible context length, leading to severe fragmentation and wasted memory. PagedAttention addressed this by chunking the KV cache into non-contiguous blocks, managed by a software-defined “page table” (the Block Table) [1].</p><p>While effective at reducing fragmentation, this approach introduces significant complexity:</p><ul><li><strong>Kernel Rewriting:</strong> Because the KV cache is no longer contiguous in virtual memory, standard attention kernels (like cuDNN SDPA or vanilla FlashAttention) cannot be used directly. Developers must rewrite kernels to manually dereference block tables [1].</li><li><strong>Software Overhead:</strong> The system must manage virtual-to-physical mapping in user space, duplicating work typically handled by the OS. This adds runtime overhead to the critical path of both the CPU (managing tables) and the GPU (performing lookups) [1].</li><li><strong>Performance Penalties:</strong> PagedAttention-based kernels have been observed to be slower than their non-paged counterparts. For example, vLLM’s paged kernel has shown to be up to 2.8x slower than FlashAttention-2 in specific tests [1].</li></ul><h4 id=the-hardware-native-alternative-vattention>The Hardware-Native Alternative: vAttention
|
||
<a class=heading-link href=#the-hardware-native-alternative-vattention><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h4><p><strong>vAttention</strong> proposes returning the responsibility of memory management to the OS and hardware. By utilizing the CUDA Virtual Memory Management (VMM) APIs, it is possible to decouple the allocation of virtual memory from physical memory [1].</p><p><strong>How it works:</strong></p><ol><li><strong>Virtual Contiguity:</strong> The system reserves a large, contiguous range of virtual addresses for the KV cache at request start.</li><li><strong>Physical Paging:</strong> Physical memory pages are allocated and mapped to this virtual range only on demand (dynamically) as the token sequence grows [1].</li><li><strong>Hardware Lookups:</strong> Because the GPU sees a contiguous virtual address range, the hardware Translation Lookaside Buffer (TLB) handles the address translation. This allows the use of unmodified, high-performance kernels like FlashAttention-2 or FlashAttention-3 without custom paging logic [1].</li></ol><h4 id=technical-challenges-and-solutions>Technical Challenges and Solutions
|
||
<a class=heading-link href=#technical-challenges-and-solutions><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h4><p>Historically, using the GPU native virtual memory for high-frequency token generation faced two major bottlenecks: <strong>Control Plane Latency</strong> and <strong>Page Granularity</strong>.</p><p><strong>1. Control Plane Latency (The API Bottleneck)</strong>
|
||
Standard memory allocation (<code>cudaMalloc</code>) is monolithic—it allocates virtual and physical memory simultaneously. The more granular driver API, <code>cuMemMap</code>, allows separating these steps but involves expensive round-trips to the OS driver. Invoking these APIs synchronously during decoding (which generates one token at a time) would stall the GPU execution pipeline [1].</p><p>To solve this, vAttention utilizes <strong>execution overlap</strong>:</p><ul><li>Because LLM decoding is autoregressive and predictable, the system knows exactly when new memory is needed (one token ahead).</li><li>The CPU initiates the memory mapping for the <em>next</em> token asynchronously while the GPU is still computing the <em>current</em> token. By the time the GPU reaches the next step, the TLB and page tables are already updated, effectively hiding the driver latency [1].</li></ul><p><strong>2. Page Size Granularity (The Fragmentation Bottleneck)</strong>
|
||
The GPU TLB hierarchy is sensitive to page sizes.</p><ul><li><strong>4KB Pages:</strong> Too small. Mapping gigabytes of KV cache with 4KB pages causes “TLB thrashing,” degrading performance.</li><li><strong>2MB Huge Pages:</strong> The standard for CUDA large allocations. However, allocating 2MB for a single token update causes massive internal fragmentation, negating the benefits of dynamic allocation.</li></ul><p>Research identified <strong>64KB</strong> as the optimal page size, offering a balance between TLB efficiency and memory utilization. While standard CUDA APIs default to 2MB, vAttention utilizes modified driver calls to enable 64KB pages, eliminating TLB thrashing without incurring the fragmentation cost of huge pages [1].</p><h4 id=performance-and-portability-implications>Performance and Portability Implications
|
||
<a class=heading-link href=#performance-and-portability-implications><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h4><p>Moving memory management from software (PagedAttention) to hardware (vAttention) yields measurable benefits:</p><ul><li><strong>Throughput:</strong> In prefill-heavy workloads, vAttention outperforms PagedAttention-based systems (like vLLM and FlashInfer) by up to 1.23x due to the elimination of software lookup overheads. In decoding, it matches or exceeds the performance of optimized paged kernels [1].</li><li><strong>Portability:</strong> A significant advantage is software compatibility. When NVIDIA released FlashAttention-3 (optimized for Hopper H100 GPUs), it did not initially support PagedAttention. vAttention enabled the immediate use of FlashAttention-3 with dynamic memory support, achieving up to 1.5x higher throughput than PagedAttention-based FlashAttention-2 [1].</li></ul><h4 id=conclusion>Conclusion
|
||
<a class=heading-link href=#conclusion><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h4><p>While PagedAttention solved the critical issue of memory fragmentation in LLMs, it necessitated a complex software abstraction layer. By leveraging low-level CUDA VMM APIs, handling allocations asynchronously to hide driver latency, and optimizing page sizes, it is possible to achieve dynamic memory management using the GPU’s native hardware. This restores the illusion of contiguous memory, simplifies kernel development, and improves inference performance.</p><h3 id=references>References
|
||
<a class=heading-link href=#references><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
||
<span class=sr-only>Link to heading</span></a></h3><p>[1] R. Prabhu et al., “vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention,” in <em>Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘25)</em>, 2025.</p></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css integrity=sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0 crossorigin=anonymous><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js integrity=sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4 crossorigin=anonymous></script><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js integrity=sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05 crossorigin=anonymous onload='renderMathInElement(document.body,{delimiters:[{left:"$$",right:"$$",display:!0},{left:"$",right:"$",display:!1},{left:"\\(",right:"\\)",display:!1},{left:"\\[",right:"\\]",display:!0}]})'></script></section></div><footer class=footer><section class=container>©
|
||
2016 -
|
||
2025
|
||
Eric X. Liu
|
||
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/d2d3974">[d2d3974]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html> |