This commit is contained in:
eric
2025-12-20 17:22:08 +00:00
parent 6e752d8af2
commit 2d7d143cbf
24 changed files with 48 additions and 48 deletions

View File

@@ -10,7 +10,7 @@ After running 66 inference tests across seven different language models ranging
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the devices computational muscle sits largely idle during single-stream LLM inference. The bottleneck isnt computation—its memory bandwidth. This isnt just a quirk of one device; its a fundamental characteristic of single-user, autoregressive token generation on edge hardware—a reality that shapes how we should approach local LLM deployment."><meta property="og:url" content="https://ericxliu.me/posts/benchmarking-llms-on-jetson-orin-nano/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)"><meta property="og:description" content="Introduction Link to heading NVIDIAs Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But theres a catch—one that reveals a fundamental tension in modern edge AI hardware design.
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the devices computational muscle sits largely idle during single-stream LLM inference. The bottleneck isnt computation—its memory bandwidth. This isnt just a quirk of one device; its a fundamental characteristic of single-user, autoregressive token generation on edge hardware—a reality that shapes how we should approach local LLM deployment."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2025-10-04T00:00:00+00:00"><meta property="article:modified_time" content="2025-10-04T20:41:50+00:00"><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=canonical href=https://ericxliu.me/posts/benchmarking-llms-on-jetson-orin-nano/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.f03d6359cf766772af14fbe07ce6aca734b321c2e15acba0bbf4e2254941c460.css integrity="sha256-8D1jWc92Z3KvFPvgfOaspzSzIcLhWsugu/TiJUlBxGA=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5><script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-3972604619956476" crossorigin=anonymous></script><script type=application/ld+json>{"@context":"http://schema.org","@type":"Person","name":"Eric X. Liu","url":"https:\/\/ericxliu.me\/","description":"Software \u0026 Performance Engineer at Google","sameAs":["https:\/\/www.linkedin.com\/in\/eric-x-liu-46648b93\/","https:\/\/git.ericxliu.me\/eric"]}</script><script type=application/ld+json>{"@context":"http://schema.org","@type":"BlogPosting","headline":"Why Your Jetson Orin Nano\u0027s 40 TOPS Goes Unused (And What That Means for Edge AI)","genre":"Blog","wordcount":"1866","url":"https:\/\/ericxliu.me\/posts\/benchmarking-llms-on-jetson-orin-nano\/","datePublished":"2025-10-04T00:00:00\u002b00:00","dateModified":"2025-10-04T20:41:50\u002b00:00","description":"\u003ch2 id=\u0022introduction\u0022\u003e\n Introduction\n \u003ca class=\u0022heading-link\u0022 href=\u0022#introduction\u0022\u003e\n \u003ci class=\u0022fa-solid fa-link\u0022 aria-hidden=\u0022true\u0022 title=\u0022Link to heading\u0022\u003e\u003c\/i\u003e\n \u003cspan class=\u0022sr-only\u0022\u003eLink to heading\u003c\/span\u003e\n \u003c\/a\u003e\n\u003c\/h2\u003e\n\u003cp\u003eNVIDIA\u0026rsquo;s Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there\u0026rsquo;s a catch—one that reveals a fundamental tension in modern edge AI hardware design.\u003c\/p\u003e\n\u003cp\u003eAfter running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device\u0026rsquo;s computational muscle sits largely idle during single-stream LLM inference. The bottleneck isn\u0026rsquo;t computation—it\u0026rsquo;s memory bandwidth. This isn\u0026rsquo;t just a quirk of one device; it\u0026rsquo;s a fundamental characteristic of single-user, autoregressive token generation on edge hardware—a reality that shapes how we should approach local LLM deployment.\u003c\/p\u003e","author":{"@type":"Person","name":"Eric X. Liu"}}</script></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=https://ericxliu.me/>Eric X. Liu's Personal Page
</a><input type=checkbox id=menu-toggle>
<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=https://ericxliu.me/posts/benchmarking-llms-on-jetson-orin-nano/>Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=/about/>About</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=https://ericxliu.me/posts/benchmarking-llms-on-jetson-orin-nano/>Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
<time datetime=2025-10-04T00:00:00Z>October 4, 2025
</time></span><span class=reading-time><i class="fa-solid fa-clock" aria-hidden=true></i>
9-minute read</span></div></div></header><div class=post-content><h2 id=introduction>Introduction
@@ -62,4 +62,4 @@ After running 66 inference tests across seven different language models ranging
2016 -
2025
Eric X. Liu
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/3f9f80d">[3f9f80d]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/2bb856b">[2bb856b]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>