<?xml version="1.0" encoding="utf-8" standalone="yes"?><rssversion="2.0"xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Eric X. Liu's Personal Page</title><link>/</link><description>Recent content on Eric X. Liu's Personal Page</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 04 Oct 2025 17:44:47 +0000</lastBuildDate><atom:linkhref="/index.xml"rel="self"type="application/rss+xml"/><item><title>Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)</title><link>/posts/benchmarking-llms-on-jetson-orin-nano/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>/posts/benchmarking-llms-on-jetson-orin-nano/</guid><description><h2 id="introduction">
<?xml version="1.0" encoding="utf-8" standalone="yes"?><rssversion="2.0"xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Eric X. Liu's Personal Page</title><link>/</link><description>Recent content on Eric X. Liu's Personal Page</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 04 Oct 2025 20:41:50 +0000</lastBuildDate><atom:linkhref="/index.xml"rel="self"type="application/rss+xml"/><item><title>Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)</title><link>/posts/benchmarking-llms-on-jetson-orin-nano/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>/posts/benchmarking-llms-on-jetson-orin-nano/</guid><description><h2 id="introduction">
<i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"></i>
@@ -6,7 +6,7 @@
</a>
</h2>
<p>NVIDIA&rsquo;s Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there&rsquo;s a catch—one that reveals a fundamental tension in modern edge AI hardware design.</p>
<p>After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device&rsquo;s computational muscle sits largely idle during LLM inference. The bottleneck isn&rsquo;t computation—it&rsquo;s memory bandwidth. This isn&rsquo;t just a quirk of one device; it&rsquo;s a reality that affects how we should think about deploying LLMs at the edge.</p></description></item><item><title>Flashing Jetson Orin Nano in Virtualized Environments</title><link>/posts/flashing-jetson-orin-nano-in-virtualized-environments/</link><pubDate>Thu, 02 Oct 2025 00:00:00 +0000</pubDate><guid>/posts/flashing-jetson-orin-nano-in-virtualized-environments/</guid><description><h1 id="flashing-jetson-orin-nano-in-virtualized-environments">
<p>After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device&rsquo;s computational muscle sits largely idle during single-stream LLM inference. The bottleneck isn&rsquo;t computation—it&rsquo;s memory bandwidth. This isn&rsquo;t just a quirk of one device; it&rsquo;s a fundamental characteristic of single-user, autoregressive token generation on edge hardware—a reality that shapes how we should approach local LLM deployment.</p></description></item><item><title>Flashing Jetson Orin Nano in Virtualized Environments</title><link>/posts/flashing-jetson-orin-nano-in-virtualized-environments/</link><pubDate>Thu, 02 Oct 2025 00:00:00 +0000</pubDate><guid>/posts/flashing-jetson-orin-nano-in-virtualized-environments/</guid><description><h1 id="flashing-jetson-orin-nano-in-virtualized-environments">
Flashing Jetson Orin Nano in Virtualized Environments
NVIDIA’s Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there’s a catch—one that reveals a fundamental tension in modern edge AI hardware design.
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device’s computational muscle sits largely idle during LLM inference. The bottleneck isn’t computation—it’s memory bandwidth. This isn’t just a quirk of one device; it’s a reality that affects how we should think about deploying LLMs at the edge."><metaname=keywordscontent="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><metaname=twitter:cardcontent="summary"><metaname=twitter:titlecontent="Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)"><metaname=twitter:descriptioncontent="Introduction Link to heading NVIDIA’s Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there’s a catch—one that reveals a fundamental tension in modern edge AI hardware design.
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device’s computational muscle sits largely idle during LLM inference. The bottleneck isn’t computation—it’s memory bandwidth. This isn’t just a quirk of one device; it’s a reality that affects how we should think about deploying LLMs at the edge."><metaproperty="og:url"content="/posts/benchmarking-llms-on-jetson-orin-nano/"><metaproperty="og:site_name"content="Eric X. Liu's Personal Page"><metaproperty="og:title"content="Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)"><metaproperty="og:description"content="Introduction Link to heading NVIDIA’s Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there’s a catch—one that reveals a fundamental tension in modern edge AI hardware design.
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device’s computational muscle sits largely idle during LLM inference. The bottleneck isn’t computation—it’s memory bandwidth. This isn’t just a quirk of one device; it’s a reality that affects how we should think about deploying LLMs at the edge."><metaproperty="og:locale"content="en"><metaproperty="og:type"content="article"><metaproperty="article:section"content="posts"><metaproperty="article:published_time"content="2025-10-04T00:00:00+00:00"><metaproperty="article:modified_time"content="2025-10-04T17:44:47+00:00"><linkrel=canonicalhref=/posts/benchmarking-llms-on-jetson-orin-nano/><linkrel=preloadhref=/fonts/fa-brands-400.woff2as=fonttype=font/woff2crossorigin><linkrel=preloadhref=/fonts/fa-regular-400.woff2as=fonttype=font/woff2crossorigin><linkrel=preloadhref=/fonts/fa-solid-900.woff2as=fonttype=font/woff2crossorigin><linkrel=stylesheethref=/css/coder.min.f03d6359cf766772af14fbe07ce6aca734b321c2e15acba0bbf4e2254941c460.cssintegrity="sha256-8D1jWc92Z3KvFPvgfOaspzSzIcLhWsugu/TiJUlBxGA="crossorigin=anonymousmedia=screen><linkrel=stylesheethref=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.cssintegrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8="crossorigin=anonymousmedia=screen><linkrel=icontype=image/svg+xmlhref=/images/favicon.svgsizes=any><linkrel=icontype=image/pnghref=/images/favicon-32x32.pngsizes=32x32><linkrel=icontype=image/pnghref=/images/favicon-16x16.pngsizes=16x16><linkrel=apple-touch-iconhref=/images/apple-touch-icon.png><linkrel=apple-touch-iconsizes=180x180href=/images/apple-touch-icon.png><linkrel=manifesthref=/site.webmanifest><linkrel=mask-iconhref=/images/safari-pinned-tab.svgcolor=#5bbad5></head><bodyclass="preload-transitions colorscheme-auto"><divclass=float-container><aid=dark-mode-toggleclass=colorscheme-toggle><iclass="fa-solid fa-adjust fa-fw"aria-hidden=true></i></a></div><mainclass=wrapper><navclass=navigation><sectionclass=container><aclass=navigation-titlehref=/>Eric X. Liu's Personal Page
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device’s computational muscle sits largely idle during single-stream LLM inference. The bottleneck isn’t computation—it’s memory bandwidth. This isn’t just a quirk of one device; it’s a fundamental characteristic of single-user, autoregressive token generation on edge hardware—a reality that shapes how we should approach local LLM deployment."><metaname=keywordscontent="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><metaname=twitter:cardcontent="summary"><metaname=twitter:titlecontent="Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)"><metaname=twitter:descriptioncontent="Introduction Link to heading NVIDIA’s Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there’s a catch—one that reveals a fundamental tension in modern edge AI hardware design.
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device’s computational muscle sits largely idle during single-stream LLM inference. The bottleneck isn’t computation—it’s memory bandwidth. This isn’t just a quirk of one device; it’s a fundamental characteristic of single-user, autoregressive token generation on edge hardware—a reality that shapes how we should approach local LLM deployment."><metaproperty="og:url"content="/posts/benchmarking-llms-on-jetson-orin-nano/"><metaproperty="og:site_name"content="Eric X. Liu's Personal Page"><metaproperty="og:title"content="Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)"><metaproperty="og:description"content="Introduction Link to heading NVIDIA’s Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there’s a catch—one that reveals a fundamental tension in modern edge AI hardware design.
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device’s computational muscle sits largely idle during single-stream LLM inference. The bottleneck isn’t computation—it’s memory bandwidth. This isn’t just a quirk of one device; it’s a fundamental characteristic of single-user, autoregressive token generation on edge hardware—a reality that shapes how we should approach local LLM deployment."><metaproperty="og:locale"content="en"><metaproperty="og:type"content="article"><metaproperty="article:section"content="posts"><metaproperty="article:published_time"content="2025-10-04T00:00:00+00:00"><metaproperty="article:modified_time"content="2025-10-04T20:41:50+00:00"><linkrel=canonicalhref=/posts/benchmarking-llms-on-jetson-orin-nano/><linkrel=preloadhref=/fonts/fa-brands-400.woff2as=fonttype=font/woff2crossorigin><linkrel=preloadhref=/fonts/fa-regular-400.woff2as=fonttype=font/woff2crossorigin><linkrel=preloadhref=/fonts/fa-solid-900.woff2as=fonttype=font/woff2crossorigin><linkrel=stylesheethref=/css/coder.min.f03d6359cf766772af14fbe07ce6aca734b321c2e15acba0bbf4e2254941c460.cssintegrity="sha256-8D1jWc92Z3KvFPvgfOaspzSzIcLhWsugu/TiJUlBxGA="crossorigin=anonymousmedia=screen><linkrel=stylesheethref=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.cssintegrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8="crossorigin=anonymousmedia=screen><linkrel=icontype=image/svg+xmlhref=/images/favicon.svgsizes=any><linkrel=icontype=image/pnghref=/images/favicon-32x32.pngsizes=32x32><linkrel=icontype=image/pnghref=/images/favicon-16x16.pngsizes=16x16><linkrel=apple-touch-iconhref=/images/apple-touch-icon.png><linkrel=apple-touch-iconsizes=180x180href=/images/apple-touch-icon.png><linkrel=manifesthref=/site.webmanifest><linkrel=mask-iconhref=/images/safari-pinned-tab.svgcolor=#5bbad5></head><bodyclass="preload-transitions colorscheme-auto"><divclass=float-container><aid=dark-mode-toggleclass=colorscheme-toggle><iclass="fa-solid fa-adjust fa-fw"aria-hidden=true></i></a></div><mainclass=wrapper><navclass=navigation><sectionclass=container><aclass=navigation-titlehref=/>Eric X. Liu's Personal Page
</a><inputtype=checkboxid=menu-toggle>
<labelclass="menu-button float-right"for=menu-toggle><iclass="fa-solid fa-bars fa-fw"aria-hidden=true></i></label><ulclass=navigation-list><liclass=navigation-item><aclass=navigation-linkhref=/posts/>Posts</a></li><liclass=navigation-item><aclass=navigation-linkhref=https://chat.ericxliu.me>Chat</a></li><liclass=navigation-item><aclass=navigation-linkhref=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><liclass=navigation-item><aclass=navigation-linkhref=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><liclass=navigation-item><aclass=navigation-linkhref=/>|</a></li><liclass=navigation-item><aclass=navigation-linkhref=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><divclass=content><sectionclass="container post"><article><header><divclass=post-title><h1class=title><aclass=title-linkhref=/posts/benchmarking-llms-on-jetson-orin-nano/>Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)</a></h1></div><divclass=post-meta><divclass=date><spanclass=posted-on><iclass="fa-solid fa-calendar"aria-hidden=true></i>
<aclass=heading-linkhref=#introduction><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h2><p>NVIDIA’s Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there’s a catch—one that reveals a fundamental tension in modern edge AI hardware design.</p><p>After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device’s computational muscle sits largely idle during LLM inference. The bottleneck isn’t computation—it’s memory bandwidth. This isn’t just a quirk of one device; it’s a reality that affects how we should think about deploying LLMs at the edge.</p><h2id=the-hardware-what-were-working-with>The Hardware: What We’re Working With
<spanclass=sr-only>Link to heading</span></a></h2><p>NVIDIA’s Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there’s a catch—one that reveals a fundamental tension in modern edge AI hardware design.</p><p>After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device’s computational muscle sits largely idle during single-stream LLM inference. The bottleneck isn’t computation—it’s memory bandwidth. This isn’t just a quirk of one device; it’s a fundamental characteristic of single-user, autoregressive token generation on edge hardware—a reality that shapes how we should approach local LLM deployment.</p><h2id=the-hardware-what-were-working-with>The Hardware: What We’re Working With
<aclass=heading-linkhref=#the-hardware-what-were-working-with><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h2><p>The NVIDIA Jetson Orin Nano 8GB I tested features:</p><ul><li><strong>GPU</strong>: NVIDIA Ampere architecture with 1024 CUDA cores and 32 Tensor Cores</li><li><strong>Compute Performance</strong>: 40 TOPS (INT8), 10 TFLOPS (FP16), 5 TFLOPS (FP32)</li><li><strong>Memory</strong>: 8GB LPDDR5 unified memory with 68 GB/s bandwidth</li><li><strong>Available VRAM</strong>: Approximately 5.2GB after OS overhead</li><li><strong>CPU</strong>: 6-core ARM Cortex-A78AE (ARMv8.2, 64-bit)</li><li><strong>TDP</strong>: 7-25W configurable</li></ul><p>The unified memory architecture is a double-edged sword: CPU and GPU share the same physical memory pool, which eliminates PCIe transfer overhead but also means you’re working with just 5.2GB of usable VRAM after the OS takes its share. This constraint shapes everything about LLM deployment on this device.</p><h2id=testing-methodology>Testing Methodology
<aclass=heading-linkhref=#testing-methodology><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
@@ -23,7 +23,7 @@ After running 66 inference tests across seven different language models ranging
<aclass=heading-linkhref=#the-models><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>I tested seven models ranging from 0.5B to 5.4B parameters—essentially the entire practical deployment range for this hardware. The selection covered two inference backends (Ollama and vLLM) and various quantization strategies:</p><p><strong>Ollama-served models (with quantization):</strong></p><ul><li>Gemma 3 1B (Q4_K_M, 815MB)</li><li>Gemma 3n E2B (Q4_K_M, 3.5GB, 5.44B total params, 2B effective)</li><li>Qwen 2.5 0.5B (Q4_K_M, 350MB)</li><li>Qwen 3 0.6B (FP8, 600MB)</li></ul><p><strong>vLLM-served models (minimal/no quantization):</strong></p><ul><li>google/gemma-3-1b-it (FP16, 2GB)</li><li>Qwen/Qwen2.5-0.5B-Instruct (FP16, 1GB)</li><li>Qwen/Qwen3-0.6B-FP8 (FP8, 600MB)</li></ul><h3id=the-testing-process>The Testing Process
<aclass=heading-linkhref=#the-testing-process><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>Each model faced 10-12 prompts of varying complexity—from simple arithmetic to technical explanations about LLMs themselves. Out of 84 planned tests, 66 completed successfully (78.6% success rate). The failures? Mostly out-of-memory crashes on larger models and occasional inference engine instability.</p><h3id=understanding-the-limits-roofline-analysis>Understanding the Limits: Roofline Analysis
<spanclass=sr-only>Link to heading</span></a></h3><p>Each model faced 10-12 prompts of varying complexity—from simple arithmetic to technical explanations about LLMs themselves. All tests ran with batch size = 1, simulating a single user interacting with a local chatbot—the typical edge deployment scenario. Out of 84 planned tests, 66 completed successfully (78.6% success rate). The failures? Mostly out-of-memory crashes on larger models and occasional inference engine instability.</p><h3id=understanding-the-limits-roofline-analysis>Understanding the Limits: Roofline Analysis
<aclass=heading-linkhref=#understanding-the-limits-roofline-analysis><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>To understand where performance hits its ceiling, I applied roofline analysis—a method that reveals whether a workload is compute-bound (limited by processing power) or memory-bound (limited by data transfer speed). For each model, I calculated:</p><ul><li><strong>FLOPs per token</strong>: Approximately 2 × total_parameters (accounting for matrix multiplications in forward pass)</li><li><strong>Bytes per token</strong>: model_size × 1.1 (including 10% overhead for activations and KV cache)</li><li><strong>Operational Intensity (OI)</strong>: FLOPs per token / Bytes per token</li><li><strong>Theoretical performance</strong>: min(compute_limit, bandwidth_limit)</li></ul><p>The roofline model works by comparing a workload’s operational intensity (how many calculations you do per byte of data moved) against the device’s balance point. If your operational intensity is too low, you’re bottlenecked by memory bandwidth—and as we’ll see, that’s exactly what happens with LLM inference.</p><p><imgsrc=/images/benchmarking-llms-on-jetson-orin-nano/16d64bdc9cf14b05b7c40c4718b8091b.pngalt="S3 File"></p><h2id=the-results-speed-and-efficiency>The Results: Speed and Efficiency
<aclass=heading-linkhref=#the-results-speed-and-efficiency><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
@@ -34,32 +34,32 @@ After running 66 inference tests across seven different language models ranging
<aclass=heading-linkhref=#responsiveness-first-token-latency><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>The time to generate the first output token—a critical metric for interactive applications—varied significantly:</p><ul><li>qwen3:0.6b (Ollama): 0.522 seconds</li><li>gemma3:1b (Ollama): 1.000 seconds</li><li>qwen2.5:0.5b (Ollama): 1.415 seconds</li><li>gemma3n:e2b (Ollama): 1.998 seconds</li></ul><p>Smaller, quantized models get to that first token faster—exactly what you want for a chatbot or interactive assistant where perceived responsiveness matters as much as raw throughput.</p><h3id=the-memory-bottleneck-revealed>The Memory Bottleneck Revealed
<aclass=heading-linkhref=#the-memory-bottleneck-revealed><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>When I compared actual performance against theoretical limits, the results were striking:</p><table><thead><tr><th>Model</th><th>Theoretical (t/s)</th><th>Actual (t/s)</th><th>Efficiency</th><th>Bottleneck</th><th>OI (FLOPs/byte)</th></tr></thead><tbody><tr><td>gemma3:1b</td><td>109.90</td><td>26.33</td><td>24.0%</td><td>Memory</td><td>3.23</td></tr><tr><td>qwen3:0.6b</td><td>103.03</td><td>38.84</td><td>37.7%</td><td>Memory</td><td>1.82</td></tr><tr><td>qwen2.5:0.5b</td><td>219.80</td><td>35.24</td><td>16.0%</td><td>Memory</td><td>3.23</td></tr><tr><td>gemma3n:e2b</td><td>54.95</td><td>8.98</td><td>16.3%</td><td>Memory</td><td>3.23</td></tr><tr><td>google/gemma-3-1b-it</td><td>30.91</td><td>4.59</td><td>14.9%</td><td>Memory</td><td>0.91</td></tr><tr><td>Qwen/Qwen3-0.6B-FP8</td><td>103.03</td><td>12.81</td><td>12.4%</td><td>Memory</td><td>1.82</td></tr><tr><td>Qwen/Qwen2.5-0.5B-Instruct</td><td>61.82</td><td>15.18</td><td>24.6%</td><td>Memory</td><td>0.91</td></tr></tbody></table><p>Every single model is memory-bound. Average hardware efficiency sits at just 20.8%—meaning the computational units spend most of their time waiting for data rather than crunching numbers. That advertised 40 TOPS? Largely untapped.
<spanclass=sr-only>Link to heading</span></a></h3><p>When I compared actual performance against theoretical limits, the results were striking:</p><table><thead><tr><th>Model</th><th>Theoretical (t/s)</th><th>Actual (t/s)</th><th>Efficiency</th><th>Bottleneck</th><th>OI (FLOPs/byte)</th></tr></thead><tbody><tr><td>gemma3:1b</td><td>109.90</td><td>26.33</td><td>24.0%</td><td>Memory</td><td>3.23</td></tr><tr><td>qwen3:0.6b</td><td>103.03</td><td>38.84</td><td>37.7%</td><td>Memory</td><td>1.82</td></tr><tr><td>qwen2.5:0.5b</td><td>219.80</td><td>35.24</td><td>16.0%</td><td>Memory</td><td>3.23</td></tr><tr><td>gemma3n:e2b</td><td>54.95</td><td>8.98</td><td>16.3%</td><td>Memory</td><td>3.23</td></tr><tr><td>google/gemma-3-1b-it</td><td>30.91</td><td>4.59</td><td>14.9%</td><td>Memory</td><td>0.91</td></tr><tr><td>Qwen/Qwen3-0.6B-FP8</td><td>103.03</td><td>12.81</td><td>12.4%</td><td>Memory</td><td>1.82</td></tr><tr><td>Qwen/Qwen2.5-0.5B-Instruct</td><td>61.82</td><td>15.18</td><td>24.6%</td><td>Memory</td><td>0.91</td></tr></tbody></table><p>Every single model is memory-bound in this single-stream inference scenario. Average hardware efficiency sits at just 20.8%—meaning the computational units spend most of their time waiting for data rather than crunching numbers. That advertised 40 TOPS? Largely untapped when generating one token at a time for a single user.
<imgsrc=/images/benchmarking-llms-on-jetson-orin-nano/ee04876d75d247f9b27a647462555777.pngalt="S3 File"></p><h2id=what-this-actually-means>What This Actually Means
<aclass=heading-linkhref=#what-this-actually-means><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h2><h3id=why-memory-bandwidth-dominates>Why Memory Bandwidth Dominates
<aclass=heading-linkhref=#why-memory-bandwidth-dominates><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>The roofline numbers tell a clear story: operational intensity ranges from 0.91 to 3.23 FLOPs/byte across all tested models. To actually saturate those 1024 CUDA cores and hit compute-bound operation, you’d need an operational intensity around 147 FLOPs/byte at the device’s 68 GB/s memory bandwidth.</p><p>In practice, for a model to actually become compute-bound on this device, it would need an operational intensity exceeding:</p><divclass=highlight><pretabindex=0style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><codeclass=language-fallbackdata-lang=fallback><spanstyle=display:flex><span>OI_threshold = Peak_Compute / Memory_Bandwidth
<spanclass=sr-only>Link to heading</span></a></h2><h3id=why-memory-bandwidth-dominates-in-single-stream-inference>Why Memory Bandwidth Dominates (in Single-Stream Inference)
<aclass=heading-linkhref=#why-memory-bandwidth-dominates-in-single-stream-inference><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>The roofline numbers tell a clear story: operational intensity ranges from 0.91 to 3.23 FLOPs/byte across all tested models during single-token generation (batch size = 1). To actually saturate those 1024 CUDA cores and hit compute-bound operation, you’d need an operational intensity around 147 FLOPs/byte at the device’s 68 GB/s memory bandwidth.</p><p>In practice, for a model to actually become compute-bound on this device during single-stream inference, it would need an operational intensity exceeding:</p><divclass=highlight><pretabindex=0style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><codeclass=language-fallbackdata-lang=fallback><spanstyle=display:flex><span>OI_threshold = Peak_Compute / Memory_Bandwidth
</span></span></code></pre></div><p>Current LLM architectures fall 100-600× short of this threshold during autoregressive decoding. The compute units are idle most of the time, simply waiting for model weights and activations to arrive from memory.</p><p>The largest model tested—gemma3n:e2b at 3.5GB quantized (5.44B total parameters, 2B effective)—shows only 16.3% efficiency, similar to other quantized models. Despite being the largest model, Q4_K_M quantization keeps its memory footprint manageable, resulting in similar operational intensity (3.23 FLOPs/byte) to the other INT4-quantized models. Its MatFormer architecture with selective parameter activation (only 2B of 5.44B params active per token) actually helps reduce memory traffic, though this benefit is partially offset by the overhead of routing logic.</p><h3id=what-this-means-for-deployment>What This Means for Deployment
<aclass=heading-linkhref=#what-this-means-for-deployment><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>The performance gap between Ollama and vLLM (2.3-5.7×) tells us something important about optimization priorities for edge devices:</p><p><strong>Qwen 2.5 0.5B:</strong> Ollama (Q4_K_M, 350MB) at 35.24 t/s vs vLLM (FP16, 1GB) at 15.18 t/s—2.32× faster
</span></span></code></pre></div><p>Single-stream autoregressive decoding falls 100-600× short of this threshold because each token generation requires loading the entire model from memory (matrix-vector multiplication) while performing only ~2 FLOPs per parameter. The compute units are idle most of the time, simply waiting for model weights and activations to arrive from memory.</p><p>Note: Production LLM serving with large batch sizes (32-256 requests) changes this dynamic dramatically—batching transforms matrix-vector operations into matrix-matrix multiplications, increasing operational intensity by 30-250× and making workloads compute-bound. However, edge devices serving single users cannot exploit this optimization.</p><p>The largest model tested—gemma3n:e2b at 3.5GB quantized (5.44B total parameters, 2B effective)—shows only 16.3% efficiency, similar to other quantized models. Despite being the largest model, Q4_K_M quantization keeps its memory footprint manageable, resulting in similar operational intensity (3.23 FLOPs/byte) to the other INT4-quantized models. Its MatFormer architecture with selective parameter activation (only 2B of 5.44B params active per token) actually helps reduce memory traffic, though this benefit is partially offset by the overhead of routing logic.</p><h3id=what-this-means-for-edge-deployment>What This Means for Edge Deployment
<aclass=heading-linkhref=#what-this-means-for-edge-deployment><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>The performance gap between Ollama and vLLM (2.3-5.7×) tells us something important about optimization priorities for single-user edge devices:</p><p><strong>Qwen 2.5 0.5B:</strong> Ollama (Q4_K_M, 350MB) at 35.24 t/s vs vLLM (FP16, 1GB) at 15.18 t/s—2.32× faster
<strong>Qwen 3 0.6B:</strong> Ollama (FP8) at 38.84 t/s vs vLLM (FP8) at 12.81 t/s—3.03× faster despite identical quantization
<strong>Gemma 3 1B:</strong> Ollama (Q4_K_M, 815MB) at 26.33 t/s vs vLLM (FP16, 2GB) at 4.59 t/s—5.74× faster</p><p>Quantization delivers near-linear performance gains by directly attacking the memory bandwidth bottleneck. Q4_K_M quantization (4.5 bits/parameter) hits a sweet spot between model quality and speed. Going lower to INT2 might help further, but you’ll need to carefully evaluate output quality.</p><p>The real insight: Ollama’s edge-first design philosophy (GGUF format, streamlined execution, optimized kernels from llama.cpp) is fundamentally better aligned with single-stream, memory-constrained edge scenarios. vLLM’s datacenter features—continuous batching, PagedAttention, tensor parallelism—add overhead without providing benefits on unified memory architectures serving single users.</p><p><strong>What you should actually do</strong>: Stick with Ollama or TensorRT-LLM using Q4_K_M/INT4 quantized models in GGUF format. Target the 0.5-1B parameter range (under 3GB) to leave headroom for KV cache. Focus your optimization efforts on memory access patterns and bandwidth reduction. Watch for emerging techniques like INT4 AWQ, sparse attention, and quantized KV caches.</p><h3id=room-for-improvement>Room for Improvement
<strong>Gemma 3 1B:</strong> Ollama (Q4_K_M, 815MB) at 26.33 t/s vs vLLM (FP16, 2GB) at 4.59 t/s—5.74× faster</p><p>In single-stream scenarios, quantization delivers near-linear performance gains by directly attacking the memory bandwidth bottleneck. Q4_K_M quantization (4.5 bits/parameter) hits a sweet spot between model quality and speed. Going lower to INT2 might help further, but you’ll need to carefully evaluate output quality.</p><p>The real insight: Ollama’s edge-first design philosophy (GGUF format, streamlined execution, optimized kernels from llama.cpp) is fundamentally better aligned with single-stream, memory-constrained edge scenarios. vLLM’s datacenter features—continuous batching, PagedAttention, tensor parallelism—add overhead without providing benefits when serving individual users on unified memory architectures. These features shine in multi-user production serving where batching can be exploited, but hurt performance in the single-stream case.</p><p><strong>What you should actually do</strong>: Stick with Ollama or TensorRT-LLM using Q4_K_M/INT4 quantized models in GGUF format. Target the 0.5-1B parameter range (under 3GB) to leave headroom for KV cache. Focus your optimization efforts on memory access patterns and bandwidth reduction. Watch for emerging techniques like INT4 AWQ, sparse attention, and quantized KV caches.</p><h3id=room-for-improvement>Room for Improvement
<aclass=heading-linkhref=#room-for-improvement><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>The 20.8% average efficiency might sound terrible, but it’s actually typical for edge AI devices. Datacenter GPUs hit 60-80% on optimized workloads, while edge devices commonly land in the 15-40% range due to architectural tradeoffs and memory bandwidth constraints.</p><p>Three factors explain the gap:</p><ol><li><strong>Architecture</strong>: Unified memory sacrifices bandwidth for integration simplicity. The 4MB L2 cache and 7-15W TDP limit further constrain performance.</li><li><strong>Software maturity</strong>: Edge inference frameworks lag behind their datacenter counterparts in optimization.</li><li><strong>Runtime overhead</strong>: Quantization/dequantization operations, Python abstractions, and non-optimized kernels all add up.</li></ol><p>The consistent 16-24% efficiency across most models suggests there’s room for 2-3× speedups through better software optimization—particularly in memory access patterns and kernel implementations. But fundamental performance leaps will require hardware changes—specifically, prioritizing memory bandwidth (200+ GB/s) over raw compute capability in future edge AI chips.</p><h2id=where-to-go-from-here>Where to Go From Here
<spanclass=sr-only>Link to heading</span></a></h3><p>The 20.8% average efficiency might sound terrible, but it’s actually typical for edge AI devices running single-stream inference. Datacenter GPUs hit 60-80% efficiency on optimized workloads—but that’s typically with large batch sizes that increase operational intensity. In comparable single-stream scenarios, even high-end GPUs see similar efficiency drops. Edge devices commonly land in the 15-40% range due to architectural tradeoffs and memory bandwidth constraints relative to their compute capability.</p><p>Three factors explain the gap:</p><ol><li><strong>Architecture</strong>: Unified memory sacrifices bandwidth for integration simplicity. The 4MB L2 cache and 7-15W TDP limit further constrain performance.</li><li><strong>Software maturity</strong>: Edge inference frameworks lag behind their datacenter counterparts in optimization.</li><li><strong>Runtime overhead</strong>: Quantization/dequantization operations, Python abstractions, and non-optimized kernels all add up.</li></ol><p>The consistent 16-24% efficiency across most models suggests there’s room for 2-3× speedups through better software optimization—particularly in memory access patterns and kernel implementations. But fundamental performance leaps will require hardware changes—specifically, prioritizing memory bandwidth (200+ GB/s) over raw compute capability in future edge AI chips.</p><h2id=where-to-go-from-here>Where to Go From Here
<aclass=heading-linkhref=#where-to-go-from-here><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h2><h3id=software-optimizations-worth-pursuing>Software Optimizations Worth Pursuing
<aclass=heading-linkhref=#software-optimizations-worth-pursuing><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><ul><li>Optimize memory access patterns in attention and MLP kernels</li><li>Implement quantized KV cache (8-bit or lower)</li><li>Tune for small batch sizes (2-4) to improve memory bus utilization</li><li>Overlap CPU-GPU pipeline operations to hide latency</li></ul><h3id=research-directions>Research Directions
<aclass=heading-linkhref=#research-directions><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><ul><li>Architectures with higher operational intensity (fewer memory accesses per compute operation)</li><li>Sparse attention patterns to reduce memory movement</li><li>On-device LoRA fine-tuning with frozen, quantized base weights</li><li>Multi-model serving with shared base model weights</li></ul><h3id=what-hardware-designers-should-focus-on>What Hardware Designers Should Focus On
<aclass=heading-linkhref=#what-hardware-designers-should-focus-on><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>Future edge AI devices need a fundamental shift in priorities: memory bandwidth over raw compute capability. Specifically:</p><ul><li>200+ GB/s memory bandwidth (3× current Jetson Orin Nano)</li><li>HBM integration for higher bandwidth density</li><li>16GB+ capacity to support 7B+ parameter models</li><li>Purpose-built INT4/INT8 accelerators with larger on-chip caches to reduce DRAM traffic</li></ul><hr><h2id=references>References
<spanclass=sr-only>Link to heading</span></a></h3><ul><li>Architectures with higher operational intensity (fewer memory accesses per compute operation)</li><li>Sparse attention patterns to reduce memory movement</li><li>On-device LoRA fine-tuning with frozen, quantized base weights</li><li>Multi-model serving with shared base model weights</li></ul><h3id=what-edge-ai-hardware-designers-should-focus-on>What Edge AI Hardware Designers Should Focus On
<aclass=heading-linkhref=#what-edge-ai-hardware-designers-should-focus-on><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<spanclass=sr-only>Link to heading</span></a></h3><p>Future edge AI devices optimized for local, single-user LLM inference need a fundamental shift in priorities: memory bandwidth over raw compute capability. Specifically:</p><ul><li>200+ GB/s memory bandwidth (3× current Jetson Orin Nano)</li><li>HBM integration for higher bandwidth density</li><li>16GB+ capacity to support 7B+ parameter models</li><li>Purpose-built INT4/INT8 accelerators with larger on-chip caches to reduce DRAM traffic</li></ul><hr><h2id=references>References
<aclass=heading-linkhref=#references><iclass="fa-solid fa-link"aria-hidden=truetitle="Link to heading"></i>
<?xml version="1.0" encoding="utf-8" standalone="yes"?><rssversion="2.0"xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Posts on Eric X. Liu's Personal Page</title><link>/posts/</link><description>Recent content in Posts on Eric X. Liu's Personal Page</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 04 Oct 2025 17:44:47 +0000</lastBuildDate><atom:linkhref="/posts/index.xml"rel="self"type="application/rss+xml"/><item><title>Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)</title><link>/posts/benchmarking-llms-on-jetson-orin-nano/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>/posts/benchmarking-llms-on-jetson-orin-nano/</guid><description><h2 id="introduction">
<?xml version="1.0" encoding="utf-8" standalone="yes"?><rssversion="2.0"xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Posts on Eric X. Liu's Personal Page</title><link>/posts/</link><description>Recent content in Posts on Eric X. Liu's Personal Page</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 04 Oct 2025 20:41:50 +0000</lastBuildDate><atom:linkhref="/posts/index.xml"rel="self"type="application/rss+xml"/><item><title>Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)</title><link>/posts/benchmarking-llms-on-jetson-orin-nano/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>/posts/benchmarking-llms-on-jetson-orin-nano/</guid><description><h2 id="introduction">
<i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"></i>
@@ -6,7 +6,7 @@
</a>
</h2>
<p>NVIDIA&rsquo;s Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there&rsquo;s a catch—one that reveals a fundamental tension in modern edge AI hardware design.</p>
<p>After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device&rsquo;s computational muscle sits largely idle during LLM inference. The bottleneck isn&rsquo;t computation—it&rsquo;s memory bandwidth. This isn&rsquo;t just a quirk of one device; it&rsquo;s a reality that affects how we should think about deploying LLMs at the edge.</p></description></item><item><title>Flashing Jetson Orin Nano in Virtualized Environments</title><link>/posts/flashing-jetson-orin-nano-in-virtualized-environments/</link><pubDate>Thu, 02 Oct 2025 00:00:00 +0000</pubDate><guid>/posts/flashing-jetson-orin-nano-in-virtualized-environments/</guid><description><h1 id="flashing-jetson-orin-nano-in-virtualized-environments">
<p>After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device&rsquo;s computational muscle sits largely idle during single-stream LLM inference. The bottleneck isn&rsquo;t computation—it&rsquo;s memory bandwidth. This isn&rsquo;t just a quirk of one device; it&rsquo;s a fundamental characteristic of single-user, autoregressive token generation on edge hardware—a reality that shapes how we should approach local LLM deployment.</p></description></item><item><title>Flashing Jetson Orin Nano in Virtualized Environments</title><link>/posts/flashing-jetson-orin-nano-in-virtualized-environments/</link><pubDate>Thu, 02 Oct 2025 00:00:00 +0000</pubDate><guid>/posts/flashing-jetson-orin-nano-in-virtualized-environments/</guid><description><h1 id="flashing-jetson-orin-nano-in-virtualized-environments">
Flashing Jetson Orin Nano in Virtualized Environments
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.