📚 Auto-publish: Add/update 6 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 17s
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 17s
Generated on: Sat Jan 10 20:10:48 UTC 2026 Source: md-personal repository
This commit is contained in:
@@ -55,7 +55,7 @@ To understand where performance hits its ceiling, I applied roofline analysis—
|
||||
|
||||
The roofline model works by comparing a workload's operational intensity (how many calculations you do per byte of data moved) against the device's balance point. If your operational intensity is too low, you're bottlenecked by memory bandwidth—and as we'll see, that's exactly what happens with LLM inference.
|
||||
|
||||

|
||||

|
||||
|
||||
|
||||
## The Results: Speed and Efficiency
|
||||
@@ -75,7 +75,7 @@ Here's how the models ranked by token generation speed:
|
||||
| 7 | google/gemma-3-1b-it | vLLM | 4.59 | 1.52 | 100% |
|
||||
|
||||
The standout finding: quantized sub-1B models hit 25-40 tokens/second, with Ollama consistently outperforming vLLM by 2-6× thanks to aggressive quantization and edge-optimized execution. These numbers align well with independent benchmarks from NVIDIA's Jetson AI Lab (Llama 3.2 3B at 27.7 t/s, SmolLM2 at 41 t/s), confirming this is typical performance for the hardware class.
|
||||

|
||||

|
||||
|
||||
### Responsiveness: First Token Latency
|
||||
|
||||
@@ -103,7 +103,7 @@ When I compared actual performance against theoretical limits, the results were
|
||||
| Qwen/Qwen2.5-0.5B-Instruct | 61.82 | 15.18 | 24.6% | Memory | 0.91 |
|
||||
|
||||
Every single model is memory-bound in this single-stream inference scenario. Average hardware efficiency sits at just 20.8%—meaning the computational units spend most of their time waiting for data rather than crunching numbers. That advertised 40 TOPS? Largely untapped when generating one token at a time for a single user.
|
||||

|
||||

|
||||
|
||||
|
||||
## What This Actually Means
|
||||
|
||||
Reference in New Issue
Block a user