deploy: 45629c5408

2026-02-04 06:20:15 +00:00
parent bd862cb238
commit 7de3b87680
38 changed files with 173 additions and 104 deletions
--- a/posts/benchmarking-llms-on-jetson-orin-nano/index.html
+++ b/posts/benchmarking-llms-on-jetson-orin-nano/index.html
@@ -39,7 +39,7 @@ After running 66 inference tests across seven different language models ranging
 <a class=heading-link href=#what-this-actually-means><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
 <span class=sr-only>Link to heading</span></a></h2><h3 id=why-memory-bandwidth-dominates-in-single-stream-inference>Why Memory Bandwidth Dominates (in Single-Stream Inference)
 <a class=heading-link href=#why-memory-bandwidth-dominates-in-single-stream-inference><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
-<span class=sr-only>Link to heading</span></a></h3><p>The roofline numbers tell a clear story: operational intensity ranges from 0.91 to 3.23 FLOPs/byte across all tested models during single-token generation (batch size = 1). To actually saturate those 1024 CUDA cores and hit compute-bound operation, you&rsquo;d need an operational intensity around 147 FLOPs/byte at the device&rsquo;s 68 GB/s memory bandwidth.</p><p>In practice, for a model to actually become compute-bound on this device during single-stream inference, it would need an operational intensity exceeding:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-fallback data-lang=fallback><span style=display:flex><span>OI_threshold = Peak_Compute / Memory_Bandwidth
+<span class=sr-only>Link to heading</span></a></h3><p>The roofline numbers tell a clear story: operational intensity ranges from 0.91 to 3.23 FLOPs/byte across all tested models during single-token generation (batch size = 1). To actually saturate those 1024 CUDA cores and hit compute-bound operation, you&rsquo;d need an operational intensity around 147 FLOPs/byte at the device&rsquo;s 68 GB/s memory bandwidth.</p><p>In practice, for a model to actually become compute-bound on this device during single-stream inference, it would need an operational intensity exceeding:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none><code class=language-fallback data-lang=fallback><span style=display:flex><span>OI_threshold = Peak_Compute / Memory_Bandwidth
 </span></span><span style=display:flex><span>             = (40 × 10^12 ops/s) / (68 × 10^9 bytes/s)
 </span></span><span style=display:flex><span>             = 588 FLOPs/byte
 </span></span></code></pre></div><p>Single-stream autoregressive decoding falls 100-600× short of this threshold because each token generation requires loading the entire model from memory (matrix-vector multiplication) while performing only ~2 FLOPs per parameter. The compute units are idle most of the time, simply waiting for model weights and activations to arrive from memory.</p><p>Note: Production LLM serving with large batch sizes (32-256 requests) changes this dynamic dramatically—batching transforms matrix-vector operations into matrix-matrix multiplications, increasing operational intensity by 30-250× and making workloads compute-bound. However, edge devices serving single users cannot exploit this optimization.</p><p>The largest model tested—gemma3n:e2b at 3.5GB quantized (5.44B total parameters, 2B effective)—shows only 16.3% efficiency, similar to other quantized models. Despite being the largest model, Q4_K_M quantization keeps its memory footprint manageable, resulting in similar operational intensity (3.23 FLOPs/byte) to the other INT4-quantized models. Its MatFormer architecture with selective parameter activation (only 2B of 5.44B params active per token) actually helps reduce memory traffic, though this benefit is partially offset by the overhead of routing logic.</p><h3 id=what-this-means-for-edge-deployment>What This Means for Edge Deployment
@@ -62,4 +62,4 @@ After running 66 inference tests across seven different language models ranging
 2016 -
 2026
 Eric X. Liu
-<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/6100dca">[6100dca]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>
+<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/45629c5">[45629c5]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>