📚 Auto-publish: Add/update 4 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 10s

Generated on: Sat Oct  4 05:52:46 UTC 2025
Source: md-personal repository
This commit is contained in:
Automated Publisher
2025-10-04 05:52:46 +00:00
parent 7ef6ce1987
commit 0e4b4194b6
4 changed files with 208 additions and 0 deletions

View File

@@ -0,0 +1,2 @@
image-b25565d6f47e1ba4ce2deca7e161726b86df356e.png|388f43c3f800483aae5ea487e8f45922.png|387cde4274484063c4c7e1f9f37c185a
image-7913a54157c2f4b8d0b7f961640a9c359b2d2a4f.png|ee04876d75d247f9b27a647462555777.png|2371421b04f856f7910dc8b46a7a6fb9

View File

@@ -0,0 +1,206 @@
---
title: "Why Your Jetson Orin Nano's 40 TOPS Goes Unused (And What That Means for Edge AI)"
date: 2025-10-04
draft: false
---
## Introduction
NVIDIA's Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there's a catch—one that reveals a fundamental tension in modern edge AI hardware design.
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device's computational muscle sits largely idle during LLM inference. The bottleneck isn't computation—it's memory bandwidth. This isn't just a quirk of one device; it's a reality that affects how we should think about deploying LLMs at the edge.
## The Hardware: What We're Working With
The NVIDIA Jetson Orin Nano 8GB I tested features:
- **GPU**: NVIDIA Ampere architecture with 1024 CUDA cores and 32 Tensor Cores
- **Compute Performance**: 40 TOPS (INT8), 10 TFLOPS (FP16), 5 TFLOPS (FP32)
- **Memory**: 8GB LPDDR5 unified memory with 68 GB/s bandwidth
- **Available VRAM**: Approximately 5.2GB after OS overhead
- **CPU**: 6-core ARM Cortex-A78AE (ARMv8.2, 64-bit)
- **TDP**: 7-25W configurable
The unified memory architecture is a double-edged sword: CPU and GPU share the same physical memory pool, which eliminates PCIe transfer overhead but also means you're working with just 5.2GB of usable VRAM after the OS takes its share. This constraint shapes everything about LLM deployment on this device.
## Testing Methodology
### The Models
I tested seven models ranging from 0.5B to 5.4B parameters—essentially the entire practical deployment range for this hardware. The selection covered two inference backends (Ollama and vLLM) and various quantization strategies:
**Ollama-served models (with quantization):**
- Gemma 3 1B (Q4_K_M, 815MB)
- Gemma 3n E2B (bfloat16, 11GB, 5.44B total params, 2B effective)
- Qwen 2.5 0.5B (Q4_K_M, 350MB)
- Qwen 3 0.6B (FP8, 600MB)
**vLLM-served models (minimal/no quantization):**
- google/gemma-3-1b-it (FP16, 2GB)
- Qwen/Qwen2.5-0.5B-Instruct (FP16, 1GB)
- Qwen/Qwen3-0.6B-FP8 (FP8, 600MB)
### The Testing Process
Each model faced 10-12 prompts of varying complexity—from simple arithmetic to technical explanations about LLMs themselves. Out of 84 planned tests, 66 completed successfully (78.6% success rate). The failures? Mostly out-of-memory crashes on larger models and occasional inference engine instability.
### Understanding the Limits: Roofline Analysis
To understand where performance hits its ceiling, I applied roofline analysis—a method that reveals whether a workload is compute-bound (limited by processing power) or memory-bound (limited by data transfer speed). For each model, I calculated:
- **FLOPs per token**: Approximately 2 × total_parameters (accounting for matrix multiplications in forward pass)
- **Bytes per token**: model_size × 1.1 (including 10% overhead for activations and KV cache)
- **Operational Intensity (OI)**: FLOPs per token / Bytes per token
- **Theoretical performance**: min(compute_limit, bandwidth_limit)
The roofline model works by comparing a workload's operational intensity (how many calculations you do per byte of data moved) against the device's balance point. If your operational intensity is too low, you're bottlenecked by memory bandwidth—and as we'll see, that's exactly what happens with LLM inference.
![S3 File](/images/benchmarking-llms-on-jetson-orin-nano/388f43c3f800483aae5ea487e8f45922.png)
## The Results: Speed and Efficiency
### What Actually Runs Fast
Here's how the models ranked by token generation speed:
| Rank | Model | Backend | Avg Speed (t/s) | Std Dev | Success Rate |
|------|-------|---------|-----------------|---------|--------------|
| 1 | qwen3:0.6b | Ollama | 38.84 | 1.42 | 100% |
| 2 | qwen2.5:0.5b | Ollama | 35.24 | 2.72 | 100% |
| 3 | gemma3:1b | Ollama | 26.33 | 2.56 | 100% |
| 4 | Qwen/Qwen2.5-0.5B-Instruct | vLLM | 15.18 | 2.15 | 100% |
| 5 | Qwen/Qwen3-0.6B-FP8 | vLLM | 12.81 | 0.36 | 100% |
| 6 | gemma3n:e2b | Ollama | 8.98 | 1.22 | 100% |
| 7 | google/gemma-3-1b-it | vLLM | 4.59 | 1.52 | 100% |
The standout finding: quantized sub-1B models hit 25-40 tokens/second, with Ollama consistently outperforming vLLM by 2-6× thanks to aggressive quantization and edge-optimized execution. These numbers align well with independent benchmarks from NVIDIA's Jetson AI Lab (Llama 3.2 3B at 27.7 t/s, SmolLM2 at 41 t/s), confirming this is typical performance for the hardware class.
![S3 File](/images/benchmarking-llms-on-jetson-orin-nano/ee04876d75d247f9b27a647462555777.png)
### Responsiveness: First Token Latency
The time to generate the first output token—a critical metric for interactive applications—varied significantly:
- qwen3:0.6b (Ollama): 0.522 seconds
- gemma3:1b (Ollama): 1.000 seconds
- qwen2.5:0.5b (Ollama): 1.415 seconds
- gemma3n:e2b (Ollama): 1.998 seconds
Smaller, quantized models get to that first token faster—exactly what you want for a chatbot or interactive assistant where perceived responsiveness matters as much as raw throughput.
### The Memory Bottleneck Revealed
When I compared actual performance against theoretical limits, the results were striking:
| Model | Theoretical (t/s) | Actual (t/s) | Efficiency | Bottleneck | OI (FLOPs/byte) |
|-------|-------------------|--------------|------------|------------|-----------------|
| gemma3:1b | 109.90 | 26.33 | 24.0% | Memory | 3.23 |
| qwen3:0.6b | 103.03 | 38.84 | 37.7% | Memory | 1.82 |
| qwen2.5:0.5b | 219.80 | 35.24 | 16.0% | Memory | 3.23 |
| gemma3n:e2b | 15.45 | 8.98 | 58.1% | Memory | 0.91 |
| google/gemma-3-1b-it | 30.91 | 4.59 | 14.9% | Memory | 0.91 |
| Qwen/Qwen3-0.6B-FP8 | 103.03 | 12.81 | 12.4% | Memory | 1.82 |
| Qwen/Qwen2.5-0.5B-Instruct | 61.82 | 15.18 | 24.6% | Memory | 0.91 |
Every single model is memory-bound. Average hardware efficiency sits at just 26.8%—meaning the computational units spend most of their time waiting for data rather than crunching numbers. That advertised 40 TOPS? Largely untapped.
![S3 File](/images/benchmarking-llms-on-jetson-orin-nano/ee04876d75d247f9b27a647462555777.png)
## What This Actually Means
### Why Memory Bandwidth Dominates
The roofline numbers tell a clear story: operational intensity ranges from 0.91 to 3.23 FLOPs/byte across all tested models. To actually saturate those 1024 CUDA cores and hit compute-bound operation, you'd need an operational intensity around 147 FLOPs/byte at the device's 68 GB/s memory bandwidth.
In practice, for a model to actually become compute-bound on this device, it would need an operational intensity exceeding:
```
OI_threshold = Peak_Compute / Memory_Bandwidth
= (40 × 10^12 ops/s) / (68 × 10^9 bytes/s)
= 588 FLOPs/byte
```
Current LLM architectures fall 100-600× short of this threshold during autoregressive decoding. The compute units are idle most of the time, simply waiting for model weights and activations to arrive from memory.
Interestingly, the largest model tested—gemma3n:e2b at 11GB and 5.44B parameters—achieved the highest efficiency at 58.1%. This makes sense: its massive 4.4 GB/token memory requirement means it's saturating the memory bandwidth, so actual performance approaches the theoretical ceiling. The model's Mixture-of-Experts architecture helps too, since it only activates a subset of parameters per token, reducing memory movement while maintaining model capacity.
### What This Means for Deployment
The performance gap between Ollama and vLLM (2.3-5.7×) tells us something important about optimization priorities for edge devices:
**Qwen 2.5 0.5B:** Ollama (Q4_K_M, 350MB) at 35.24 t/s vs vLLM (FP16, 1GB) at 15.18 t/s—2.32× faster
**Qwen 3 0.6B:** Ollama (FP8) at 38.84 t/s vs vLLM (FP8) at 12.81 t/s—3.03× faster despite identical quantization
**Gemma 3 1B:** Ollama (Q4_K_M, 815MB) at 26.33 t/s vs vLLM (FP16, 2GB) at 4.59 t/s—5.74× faster
Quantization delivers near-linear performance gains by directly attacking the memory bandwidth bottleneck. Q4_K_M quantization (4.5 bits/parameter) hits a sweet spot between model quality and speed. Going lower to INT2 might help further, but you'll need to carefully evaluate output quality.
The real insight: Ollama's edge-first design philosophy (GGUF format, streamlined execution, optimized kernels from llama.cpp) is fundamentally better aligned with single-stream, memory-constrained edge scenarios. vLLM's datacenter features—continuous batching, PagedAttention, tensor parallelism—add overhead without providing benefits on unified memory architectures serving single users.
**What you should actually do**: Stick with Ollama or TensorRT-LLM using Q4_K_M/INT4 quantized models in GGUF format. Target the 0.5-1B parameter range (under 3GB) to leave headroom for KV cache. Focus your optimization efforts on memory access patterns and bandwidth reduction. Watch for emerging techniques like INT4 AWQ, sparse attention, and quantized KV caches.
### Room for Improvement
The 26.8% average efficiency might sound terrible, but it's actually typical for edge AI devices. Datacenter GPUs hit 60-80% on optimized workloads, while edge devices commonly land in the 30-50% range due to architectural tradeoffs.
Three factors explain the gap:
1. **Architecture**: Unified memory sacrifices bandwidth for integration simplicity. The 4MB L2 cache and 7-15W TDP limit further constrain performance.
2. **Software maturity**: Edge inference frameworks lag behind their datacenter counterparts in optimization.
3. **Runtime overhead**: Quantization/dequantization operations, Python abstractions, and non-optimized kernels all add up.
The gemma3n:e2b model proving that 58.1% is achievable suggests smaller models could see 2-3× speedups through better software. But fundamental performance leaps will require hardware changes—specifically, prioritizing memory bandwidth (200+ GB/s) over raw compute capability in future edge AI chips.
## Where to Go From Here
### Software Optimizations Worth Pursuing
- Optimize memory access patterns in attention and MLP kernels
- Implement quantized KV cache (8-bit or lower)
- Tune for small batch sizes (2-4) to improve memory bus utilization
- Overlap CPU-GPU pipeline operations to hide latency
### Research Directions
- Architectures with higher operational intensity (fewer memory accesses per compute operation)
- Sparse attention patterns to reduce memory movement
- On-device LoRA fine-tuning with frozen, quantized base weights
- Multi-model serving with shared base model weights
### What Hardware Designers Should Focus On
Future edge AI devices need a fundamental shift in priorities: memory bandwidth over raw compute capability. Specifically:
- 200+ GB/s memory bandwidth (3× current Jetson Orin Nano)
- HBM integration for higher bandwidth density
- 16GB+ capacity to support 7B+ parameter models
- Purpose-built INT4/INT8 accelerators with larger on-chip caches to reduce DRAM traffic
---
## References
1. Williams, S., Waterman, A., & Patterson, D. (2009). "Roofline: An Insightful Visual Performance Model for Multicore Architectures." *Communications of the ACM*, 52(4), 65-76.
2. NVIDIA Corporation. (2024). "Jetson Orin Nano Developer Kit Technical Specifications." [https://developer.nvidia.com/embedded/jetson-orin-nano-developer-kit](https://developer.nvidia.com/embedded/jetson-orin-nano-developer-kit)
3. "Jetson AI Lab Benchmarks." NVIDIA Jetson AI Lab. [https://www.jetson-ai-lab.com/benchmarks.html](https://www.jetson-ai-lab.com/benchmarks.html)
4. Gerganov, G., et al. (2023). "GGML - AI at the edge." *GitHub*. [https://github.com/ggerganov/ggml](https://github.com/ggerganov/ggml)
5. Kwon, W., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *Proceedings of SOSP 2023*.
6. Team, G., Mesnard, T., et al. (2025). "Gemma 3: Technical Report." *arXiv preprint arXiv:2503.19786v1*. [https://arxiv.org/html/2503.19786v1](https://arxiv.org/html/2503.19786v1)
7. Yang, A., et al. (2025). "Qwen3 Technical Report." *arXiv preprint arXiv:2505.09388*. [https://arxiv.org/pdf/2505.09388](https://arxiv.org/pdf/2505.09388)
8. DeepSeek-AI. (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." *arXiv preprint arXiv:2501.12948v1*. [https://arxiv.org/html/2501.12948v1](https://arxiv.org/html/2501.12948v1)
9. "Running LLMs with TensorRT-LLM on NVIDIA Jetson Orin Nano Super." Collabnix. [https://collabnix.com/running-llms-with-tensorrt-llm-on-nvidia-jetson-orin-nano-super/](https://collabnix.com/running-llms-with-tensorrt-llm-on-nvidia-jetson-orin-nano-super/)
10. Pope, R., et al. (2022). "Efficiently Scaling Transformer Inference." *Proceedings of MLSys 2022*.
11. Frantar, E., et al. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." *Proceedings of ICLR 2023*.
12. Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." *Proceedings of NeurIPS 2023*.
13. Lin, J., et al. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." *arXiv preprint arXiv:2306.00978*.

Binary file not shown.

After

Width:  |  Height:  |  Size: 673 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 374 KiB