From 85e0d053b747620b62ba43af42d13c233e0b1031 Mon Sep 17 00:00:00 2001 From: Automated Publisher Date: Sat, 4 Oct 2025 17:44:11 +0000 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=9A=20Auto-publish:=20Add/update=201?= =?UTF-8?q?=20blog=20posts?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Generated on: Sat Oct 4 17:44:11 UTC 2025 Source: md-personal repository --- .../posts/benchmarking-llms-on-jetson-orin-nano.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/content/posts/benchmarking-llms-on-jetson-orin-nano.md b/content/posts/benchmarking-llms-on-jetson-orin-nano.md index d5ece34..8e1a0b1 100644 --- a/content/posts/benchmarking-llms-on-jetson-orin-nano.md +++ b/content/posts/benchmarking-llms-on-jetson-orin-nano.md @@ -31,7 +31,7 @@ I tested seven models ranging from 0.5B to 5.4B parameters—essentially the ent **Ollama-served models (with quantization):** - Gemma 3 1B (Q4_K_M, 815MB) -- Gemma 3n E2B (bfloat16, 11GB, 5.44B total params, 2B effective) +- Gemma 3n E2B (Q4_K_M, 3.5GB, 5.44B total params, 2B effective) - Qwen 2.5 0.5B (Q4_K_M, 350MB) - Qwen 3 0.6B (FP8, 600MB) @@ -97,12 +97,12 @@ When I compared actual performance against theoretical limits, the results were | gemma3:1b | 109.90 | 26.33 | 24.0% | Memory | 3.23 | | qwen3:0.6b | 103.03 | 38.84 | 37.7% | Memory | 1.82 | | qwen2.5:0.5b | 219.80 | 35.24 | 16.0% | Memory | 3.23 | -| gemma3n:e2b | 15.45 | 8.98 | 58.1% | Memory | 0.91 | +| gemma3n:e2b | 54.95 | 8.98 | 16.3% | Memory | 3.23 | | google/gemma-3-1b-it | 30.91 | 4.59 | 14.9% | Memory | 0.91 | | Qwen/Qwen3-0.6B-FP8 | 103.03 | 12.81 | 12.4% | Memory | 1.82 | | Qwen/Qwen2.5-0.5B-Instruct | 61.82 | 15.18 | 24.6% | Memory | 0.91 | -Every single model is memory-bound. Average hardware efficiency sits at just 26.8%—meaning the computational units spend most of their time waiting for data rather than crunching numbers. That advertised 40 TOPS? Largely untapped. +Every single model is memory-bound. Average hardware efficiency sits at just 20.8%—meaning the computational units spend most of their time waiting for data rather than crunching numbers. That advertised 40 TOPS? Largely untapped. ![S3 File](/images/benchmarking-llms-on-jetson-orin-nano/ee04876d75d247f9b27a647462555777.png) @@ -122,7 +122,7 @@ OI_threshold = Peak_Compute / Memory_Bandwidth Current LLM architectures fall 100-600× short of this threshold during autoregressive decoding. The compute units are idle most of the time, simply waiting for model weights and activations to arrive from memory. -Interestingly, the largest model tested—gemma3n:e2b at 11GB and 5.44B parameters—achieved the highest efficiency at 58.1%. This makes sense: its massive 4.4 GB/token memory requirement means it's saturating the memory bandwidth, so actual performance approaches the theoretical ceiling. The model's Mixture-of-Experts architecture helps too, since it only activates a subset of parameters per token, reducing memory movement while maintaining model capacity. +The largest model tested—gemma3n:e2b at 3.5GB quantized (5.44B total parameters, 2B effective)—shows only 16.3% efficiency, similar to other quantized models. Despite being the largest model, Q4_K_M quantization keeps its memory footprint manageable, resulting in similar operational intensity (3.23 FLOPs/byte) to the other INT4-quantized models. Its MatFormer architecture with selective parameter activation (only 2B of 5.44B params active per token) actually helps reduce memory traffic, though this benefit is partially offset by the overhead of routing logic. ### What This Means for Deployment @@ -140,7 +140,7 @@ The real insight: Ollama's edge-first design philosophy (GGUF format, streamline ### Room for Improvement -The 26.8% average efficiency might sound terrible, but it's actually typical for edge AI devices. Datacenter GPUs hit 60-80% on optimized workloads, while edge devices commonly land in the 30-50% range due to architectural tradeoffs. +The 20.8% average efficiency might sound terrible, but it's actually typical for edge AI devices. Datacenter GPUs hit 60-80% on optimized workloads, while edge devices commonly land in the 15-40% range due to architectural tradeoffs and memory bandwidth constraints. Three factors explain the gap: @@ -148,7 +148,7 @@ Three factors explain the gap: 2. **Software maturity**: Edge inference frameworks lag behind their datacenter counterparts in optimization. 3. **Runtime overhead**: Quantization/dequantization operations, Python abstractions, and non-optimized kernels all add up. -The gemma3n:e2b model proving that 58.1% is achievable suggests smaller models could see 2-3× speedups through better software. But fundamental performance leaps will require hardware changes—specifically, prioritizing memory bandwidth (200+ GB/s) over raw compute capability in future edge AI chips. +The consistent 16-24% efficiency across most models suggests there's room for 2-3× speedups through better software optimization—particularly in memory access patterns and kernel implementations. But fundamental performance leaps will require hardware changes—specifically, prioritizing memory bandwidth (200+ GB/s) over raw compute capability in future edge AI chips. ## Where to Go From Here