📚 Auto-publish: Add/update 6 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 58s
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 58s
Generated on: Thu Jan 8 18:13:13 UTC 2026 Source: md-personal repository
This commit is contained in:
@@ -55,7 +55,7 @@ To understand where performance hits its ceiling, I applied roofline analysis—
|
|||||||
|
|
||||||
The roofline model works by comparing a workload's operational intensity (how many calculations you do per byte of data moved) against the device's balance point. If your operational intensity is too low, you're bottlenecked by memory bandwidth—and as we'll see, that's exactly what happens with LLM inference.
|
The roofline model works by comparing a workload's operational intensity (how many calculations you do per byte of data moved) against the device's balance point. If your operational intensity is too low, you're bottlenecked by memory bandwidth—and as we'll see, that's exactly what happens with LLM inference.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
|
|
||||||
## The Results: Speed and Efficiency
|
## The Results: Speed and Efficiency
|
||||||
@@ -75,7 +75,7 @@ Here's how the models ranked by token generation speed:
|
|||||||
| 7 | google/gemma-3-1b-it | vLLM | 4.59 | 1.52 | 100% |
|
| 7 | google/gemma-3-1b-it | vLLM | 4.59 | 1.52 | 100% |
|
||||||
|
|
||||||
The standout finding: quantized sub-1B models hit 25-40 tokens/second, with Ollama consistently outperforming vLLM by 2-6× thanks to aggressive quantization and edge-optimized execution. These numbers align well with independent benchmarks from NVIDIA's Jetson AI Lab (Llama 3.2 3B at 27.7 t/s, SmolLM2 at 41 t/s), confirming this is typical performance for the hardware class.
|
The standout finding: quantized sub-1B models hit 25-40 tokens/second, with Ollama consistently outperforming vLLM by 2-6× thanks to aggressive quantization and edge-optimized execution. These numbers align well with independent benchmarks from NVIDIA's Jetson AI Lab (Llama 3.2 3B at 27.7 t/s, SmolLM2 at 41 t/s), confirming this is typical performance for the hardware class.
|
||||||

|

|
||||||
|
|
||||||
### Responsiveness: First Token Latency
|
### Responsiveness: First Token Latency
|
||||||
|
|
||||||
@@ -103,7 +103,7 @@ When I compared actual performance against theoretical limits, the results were
|
|||||||
| Qwen/Qwen2.5-0.5B-Instruct | 61.82 | 15.18 | 24.6% | Memory | 0.91 |
|
| Qwen/Qwen2.5-0.5B-Instruct | 61.82 | 15.18 | 24.6% | Memory | 0.91 |
|
||||||
|
|
||||||
Every single model is memory-bound in this single-stream inference scenario. Average hardware efficiency sits at just 20.8%—meaning the computational units spend most of their time waiting for data rather than crunching numbers. That advertised 40 TOPS? Largely untapped when generating one token at a time for a single user.
|
Every single model is memory-bound in this single-stream inference scenario. Average hardware efficiency sits at just 20.8%—meaning the computational units spend most of their time waiting for data rather than crunching numbers. That advertised 40 TOPS? Largely untapped when generating one token at a time for a single user.
|
||||||

|

|
||||||
|
|
||||||
|
|
||||||
## What This Actually Means
|
## What This Actually Means
|
||||||
|
|||||||
@@ -11,7 +11,7 @@ draft: false
|
|||||||
|
|
||||||
Flashing NVIDIA Jetson devices remotely presents unique challenges when the host machine is virtualized. This article documents the technical challenges, failures, and eventual success of flashing a Jetson Orin Nano Super developer kit using NVIDIA SDK Manager in various virtualized environments, specifically focusing on QEMU/KVM virtual machines and LXC containers on Proxmox VE.
|
Flashing NVIDIA Jetson devices remotely presents unique challenges when the host machine is virtualized. This article documents the technical challenges, failures, and eventual success of flashing a Jetson Orin Nano Super developer kit using NVIDIA SDK Manager in various virtualized environments, specifically focusing on QEMU/KVM virtual machines and LXC containers on Proxmox VE.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
### The Constraint: Hypervisor-Only Infrastructure
|
### The Constraint: Hypervisor-Only Infrastructure
|
||||||
|
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ draft: false
|
|||||||
Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don't inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).
|
Large Language Models (LLMs) have demonstrated astonishing capabilities, but out-of-the-box, they are simply powerful text predictors. They don't inherently understand what makes a response helpful, harmless, or aligned with human values. The technique that has proven most effective at bridging this gap is Reinforcement Learning from Human Feedback (RLHF), and at its heart lies a powerful algorithm: Proximal Policy Optimization (PPO).
|
||||||
|
|
||||||
You may have seen diagrams like the one below, which outlines the RLHF training process. It can look intimidating, with a web of interconnected models, losses, and data flows.
|
You may have seen diagrams like the one below, which outlines the RLHF training process. It can look intimidating, with a web of interconnected models, losses, and data flows.
|
||||||

|

|
||||||
|
|
||||||
This post will decode that diagram, piece by piece. We'll explore the "why" behind each component, moving from high-level concepts to the deep technical reasoning that makes this process work.
|
This post will decode that diagram, piece by piece. We'll explore the "why" behind each component, moving from high-level concepts to the deep technical reasoning that makes this process work.
|
||||||
|
|
||||||
|
|||||||
@@ -77,7 +77,7 @@ It turned out to be a syntax error in my arguments passed to the `Trainer` (or r
|
|||||||
### Pitfall #2: Stability vs. Noise
|
### Pitfall #2: Stability vs. Noise
|
||||||
The loss curve was initially extremely erratic. The batch size on my GPU was limited (Physical Batch Size = 4).
|
The loss curve was initially extremely erratic. The batch size on my GPU was limited (Physical Batch Size = 4).
|
||||||
**The Fix**: I implemented **Gradient Accumulation** (accumulating over 8 steps) to simulate a batch size of 32. This smoothed out the optimization landscape significantly.
|
**The Fix**: I implemented **Gradient Accumulation** (accumulating over 8 steps) to simulate a batch size of 32. This smoothed out the optimization landscape significantly.
|
||||||

|

|
||||||
|
|
||||||
### Pitfall #3: Overfitting
|
### Pitfall #3: Overfitting
|
||||||
With a small dataset (~2k samples), overfitting is a real risk. I employed a multi-layered defense strategy:
|
With a small dataset (~2k samples), overfitting is a real risk. I employed a multi-layered defense strategy:
|
||||||
|
|||||||
@@ -40,7 +40,7 @@ The dimensions of the weight matrices are as follows:
|
|||||||
### 3. Deconstructing Multi-Head Attention (MHA)
|
### 3. Deconstructing Multi-Head Attention (MHA)
|
||||||
|
|
||||||
The core innovation of the Transformer is Multi-Head Attention. It allows the model to weigh the importance of different tokens in the sequence from multiple perspectives simultaneously.
|
The core innovation of the Transformer is Multi-Head Attention. It allows the model to weigh the importance of different tokens in the sequence from multiple perspectives simultaneously.
|
||||||

|

|
||||||
#### 3.1. The "Why": Beyond a Single Attention
|
#### 3.1. The "Why": Beyond a Single Attention
|
||||||
A single attention mechanism would force the model to average all types of linguistic relationships into one pattern. MHA avoids this by creating `h` parallel subspaces. Each "head" can specialize, with one head learning syntactic dependencies, another tracking semantic similarity, and so on. This creates a much richer representation.
|
A single attention mechanism would force the model to average all types of linguistic relationships into one pattern. MHA avoids this by creating `h` parallel subspaces. Each "head" can specialize, with one head learning syntactic dependencies, another tracking semantic similarity, and so on. This creates a much richer representation.
|
||||||
|
|
||||||
|
|||||||
@@ -55,12 +55,12 @@ The final configuration groups the individual VLANs into distinct zones, forming
|
|||||||
* **DMZ:** Contains the `dns` and `prod` networks for semi-trusted, exposed services.
|
* **DMZ:** Contains the `dns` and `prod` networks for semi-trusted, exposed services.
|
||||||
* **IoT:** Contains the `iot` network. This is a low-trust zone for smart devices.
|
* **IoT:** Contains the `iot` network. This is a low-trust zone for smart devices.
|
||||||
* **Management:** Contains the `management` network. This is a highly privileged, isolated zone for network infrastructure.
|
* **Management:** Contains the `management` network. This is a highly privileged, isolated zone for network infrastructure.
|
||||||

|

|
||||||
|
|
||||||
#### The Security Policy Matrix
|
#### The Security Policy Matrix
|
||||||
|
|
||||||
The true power of this model is realized in the firewall's zone matrix, which dictates the default traffic flow between each zone.
|
The true power of this model is realized in the firewall's zone matrix, which dictates the default traffic flow between each zone.
|
||||||

|

|
||||||
|
|
||||||
This matrix enforces the desired security policy with clear, high-level rules:
|
This matrix enforces the desired security policy with clear, high-level rules:
|
||||||
* **Complete IoT Isolation:** The `IoT` row shows that devices in this zone are blocked from initiating any communication with any other internal zone. Their only allowed path is out to the internet.
|
* **Complete IoT Isolation:** The `IoT` row shows that devices in this zone are blocked from initiating any communication with any other internal zone. Their only allowed path is out to the internet.
|
||||||
|
|||||||
Reference in New Issue
Block a user