📚 Auto-publish: Add/update 1 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 12s
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 12s
Generated on: Sat Oct 4 20:41:50 UTC 2025 Source: md-personal repository
This commit is contained in:
@@ -8,7 +8,7 @@ draft: false
|
||||
|
||||
NVIDIA's Jetson Orin Nano promises impressive specs: 1024 CUDA cores, 32 Tensor Cores, and 40 TOPS of INT8 compute performance packed into a compact, power-efficient edge device. On paper, it looks like a capable platform for running Large Language Models locally. But there's a catch—one that reveals a fundamental tension in modern edge AI hardware design.
|
||||
|
||||
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device's computational muscle sits largely idle during LLM inference. The bottleneck isn't computation—it's memory bandwidth. This isn't just a quirk of one device; it's a reality that affects how we should think about deploying LLMs at the edge.
|
||||
After running 66 inference tests across seven different language models ranging from 0.5B to 5.4B parameters, I discovered something counterintuitive: the device's computational muscle sits largely idle during single-stream LLM inference. The bottleneck isn't computation—it's memory bandwidth. This isn't just a quirk of one device; it's a fundamental characteristic of single-user, autoregressive token generation on edge hardware—a reality that shapes how we should approach local LLM deployment.
|
||||
|
||||
## The Hardware: What We're Working With
|
||||
|
||||
@@ -42,7 +42,7 @@ I tested seven models ranging from 0.5B to 5.4B parameters—essentially the ent
|
||||
|
||||
### The Testing Process
|
||||
|
||||
Each model faced 10-12 prompts of varying complexity—from simple arithmetic to technical explanations about LLMs themselves. Out of 84 planned tests, 66 completed successfully (78.6% success rate). The failures? Mostly out-of-memory crashes on larger models and occasional inference engine instability.
|
||||
Each model faced 10-12 prompts of varying complexity—from simple arithmetic to technical explanations about LLMs themselves. All tests ran with batch size = 1, simulating a single user interacting with a local chatbot—the typical edge deployment scenario. Out of 84 planned tests, 66 completed successfully (78.6% success rate). The failures? Mostly out-of-memory crashes on larger models and occasional inference engine instability.
|
||||
|
||||
### Understanding the Limits: Roofline Analysis
|
||||
|
||||
@@ -57,6 +57,7 @@ The roofline model works by comparing a workload's operational intensity (how ma
|
||||
|
||||

|
||||
|
||||
|
||||
## The Results: Speed and Efficiency
|
||||
|
||||
### What Actually Runs Fast
|
||||
@@ -101,17 +102,17 @@ When I compared actual performance against theoretical limits, the results were
|
||||
| Qwen/Qwen3-0.6B-FP8 | 103.03 | 12.81 | 12.4% | Memory | 1.82 |
|
||||
| Qwen/Qwen2.5-0.5B-Instruct | 61.82 | 15.18 | 24.6% | Memory | 0.91 |
|
||||
|
||||
Every single model is memory-bound. Average hardware efficiency sits at just 20.8%—meaning the computational units spend most of their time waiting for data rather than crunching numbers. That advertised 40 TOPS? Largely untapped.
|
||||
Every single model is memory-bound in this single-stream inference scenario. Average hardware efficiency sits at just 20.8%—meaning the computational units spend most of their time waiting for data rather than crunching numbers. That advertised 40 TOPS? Largely untapped when generating one token at a time for a single user.
|
||||

|
||||
|
||||
|
||||
## What This Actually Means
|
||||
|
||||
### Why Memory Bandwidth Dominates
|
||||
### Why Memory Bandwidth Dominates (in Single-Stream Inference)
|
||||
|
||||
The roofline numbers tell a clear story: operational intensity ranges from 0.91 to 3.23 FLOPs/byte across all tested models. To actually saturate those 1024 CUDA cores and hit compute-bound operation, you'd need an operational intensity around 147 FLOPs/byte at the device's 68 GB/s memory bandwidth.
|
||||
The roofline numbers tell a clear story: operational intensity ranges from 0.91 to 3.23 FLOPs/byte across all tested models during single-token generation (batch size = 1). To actually saturate those 1024 CUDA cores and hit compute-bound operation, you'd need an operational intensity around 147 FLOPs/byte at the device's 68 GB/s memory bandwidth.
|
||||
|
||||
In practice, for a model to actually become compute-bound on this device, it would need an operational intensity exceeding:
|
||||
In practice, for a model to actually become compute-bound on this device during single-stream inference, it would need an operational intensity exceeding:
|
||||
|
||||
```
|
||||
OI_threshold = Peak_Compute / Memory_Bandwidth
|
||||
@@ -119,27 +120,29 @@ OI_threshold = Peak_Compute / Memory_Bandwidth
|
||||
= 588 FLOPs/byte
|
||||
```
|
||||
|
||||
Current LLM architectures fall 100-600× short of this threshold during autoregressive decoding. The compute units are idle most of the time, simply waiting for model weights and activations to arrive from memory.
|
||||
Single-stream autoregressive decoding falls 100-600× short of this threshold because each token generation requires loading the entire model from memory (matrix-vector multiplication) while performing only ~2 FLOPs per parameter. The compute units are idle most of the time, simply waiting for model weights and activations to arrive from memory.
|
||||
|
||||
Note: Production LLM serving with large batch sizes (32-256 requests) changes this dynamic dramatically—batching transforms matrix-vector operations into matrix-matrix multiplications, increasing operational intensity by 30-250× and making workloads compute-bound. However, edge devices serving single users cannot exploit this optimization.
|
||||
|
||||
The largest model tested—gemma3n:e2b at 3.5GB quantized (5.44B total parameters, 2B effective)—shows only 16.3% efficiency, similar to other quantized models. Despite being the largest model, Q4_K_M quantization keeps its memory footprint manageable, resulting in similar operational intensity (3.23 FLOPs/byte) to the other INT4-quantized models. Its MatFormer architecture with selective parameter activation (only 2B of 5.44B params active per token) actually helps reduce memory traffic, though this benefit is partially offset by the overhead of routing logic.
|
||||
|
||||
### What This Means for Deployment
|
||||
### What This Means for Edge Deployment
|
||||
|
||||
The performance gap between Ollama and vLLM (2.3-5.7×) tells us something important about optimization priorities for edge devices:
|
||||
The performance gap between Ollama and vLLM (2.3-5.7×) tells us something important about optimization priorities for single-user edge devices:
|
||||
|
||||
**Qwen 2.5 0.5B:** Ollama (Q4_K_M, 350MB) at 35.24 t/s vs vLLM (FP16, 1GB) at 15.18 t/s—2.32× faster
|
||||
**Qwen 3 0.6B:** Ollama (FP8) at 38.84 t/s vs vLLM (FP8) at 12.81 t/s—3.03× faster despite identical quantization
|
||||
**Gemma 3 1B:** Ollama (Q4_K_M, 815MB) at 26.33 t/s vs vLLM (FP16, 2GB) at 4.59 t/s—5.74× faster
|
||||
|
||||
Quantization delivers near-linear performance gains by directly attacking the memory bandwidth bottleneck. Q4_K_M quantization (4.5 bits/parameter) hits a sweet spot between model quality and speed. Going lower to INT2 might help further, but you'll need to carefully evaluate output quality.
|
||||
In single-stream scenarios, quantization delivers near-linear performance gains by directly attacking the memory bandwidth bottleneck. Q4_K_M quantization (4.5 bits/parameter) hits a sweet spot between model quality and speed. Going lower to INT2 might help further, but you'll need to carefully evaluate output quality.
|
||||
|
||||
The real insight: Ollama's edge-first design philosophy (GGUF format, streamlined execution, optimized kernels from llama.cpp) is fundamentally better aligned with single-stream, memory-constrained edge scenarios. vLLM's datacenter features—continuous batching, PagedAttention, tensor parallelism—add overhead without providing benefits on unified memory architectures serving single users.
|
||||
The real insight: Ollama's edge-first design philosophy (GGUF format, streamlined execution, optimized kernels from llama.cpp) is fundamentally better aligned with single-stream, memory-constrained edge scenarios. vLLM's datacenter features—continuous batching, PagedAttention, tensor parallelism—add overhead without providing benefits when serving individual users on unified memory architectures. These features shine in multi-user production serving where batching can be exploited, but hurt performance in the single-stream case.
|
||||
|
||||
**What you should actually do**: Stick with Ollama or TensorRT-LLM using Q4_K_M/INT4 quantized models in GGUF format. Target the 0.5-1B parameter range (under 3GB) to leave headroom for KV cache. Focus your optimization efforts on memory access patterns and bandwidth reduction. Watch for emerging techniques like INT4 AWQ, sparse attention, and quantized KV caches.
|
||||
|
||||
### Room for Improvement
|
||||
|
||||
The 20.8% average efficiency might sound terrible, but it's actually typical for edge AI devices. Datacenter GPUs hit 60-80% on optimized workloads, while edge devices commonly land in the 15-40% range due to architectural tradeoffs and memory bandwidth constraints.
|
||||
The 20.8% average efficiency might sound terrible, but it's actually typical for edge AI devices running single-stream inference. Datacenter GPUs hit 60-80% efficiency on optimized workloads—but that's typically with large batch sizes that increase operational intensity. In comparable single-stream scenarios, even high-end GPUs see similar efficiency drops. Edge devices commonly land in the 15-40% range due to architectural tradeoffs and memory bandwidth constraints relative to their compute capability.
|
||||
|
||||
Three factors explain the gap:
|
||||
|
||||
@@ -165,9 +168,9 @@ The consistent 16-24% efficiency across most models suggests there's room for 2-
|
||||
- On-device LoRA fine-tuning with frozen, quantized base weights
|
||||
- Multi-model serving with shared base model weights
|
||||
|
||||
### What Hardware Designers Should Focus On
|
||||
### What Edge AI Hardware Designers Should Focus On
|
||||
|
||||
Future edge AI devices need a fundamental shift in priorities: memory bandwidth over raw compute capability. Specifically:
|
||||
Future edge AI devices optimized for local, single-user LLM inference need a fundamental shift in priorities: memory bandwidth over raw compute capability. Specifically:
|
||||
|
||||
- 200+ GB/s memory bandwidth (3× current Jetson Orin Nano)
|
||||
- HBM integration for higher bandwidth density
|
||||
|
Reference in New Issue
Block a user