📚 Auto-publish: Add/update 1 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 53s

Generated on: Thu Aug 14 06:50:22 UTC 2025
Source: md-personal repository
This commit is contained in:
Automated Publisher
2025-08-14 06:50:22 +00:00
parent c9ed800d9f
commit 2a5ee6dd11

View File

@@ -0,0 +1,111 @@
---
title: "Fixing GPU Operator Pods Stuck in Init: Secure Boot, DKMS, and MOK on Proxmox + Debian"
date: 2025-08-09
draft: false
---
I hit an issue where all GPU Operator pods on one node were stuck in Init after migrating from Legacy BIOS to UEFI. The common error was NVIDIA components waiting for “toolkit-ready,” while the toolkit init container looped with:
- nvidia-smi failed to communicate with the NVIDIA driver
- modprobe nvidia → “Key was rejected by service”
That message is the tell: Secure Boot is enabled and the kernel refuses to load modules not signed by a trusted key.
### Environment
- Proxmox VM (QEMU/KVM) 8.4.9
- Debian 12 (bookworm), kernel 6.1
- GPU: NVIDIA Tesla V100 (GV100GL)
- NVIDIA driver installed via Debian packages (nvidia-driver, nvidia-kernel-dkms)
### Root Cause
- Secure Boot enabled (verified with `mokutil --sb-state`)
- NVIDIA DKMS modules were built, but the signing key was not trusted by the UEFI shim/firmware
- VM booted via the fallback “UEFI QEMU HARDDISK” path (not shim), so MOK requests didnt run; no MOK screen
### Strategy
Keep Secure Boot on; get modules trusted. That requires:
1) Ensure the VM boots via shim (so MOK can work)
2) Make sure DKMS signs modules with a MOK key/cert
3) Enroll that MOK into the firmware via shims MokManager
### Step 1 — Boot via shim and persist EFI variables
In Proxmox (VM stopped):
- BIOS: OVMF (UEFI)
- Add EFI Disk (stores OVMF VARS; required for MOK)
- Machine: q35
- Enable Secure Boot (option shows only with OVMF + EFI Disk)
Inside Debian:
- Ensure ESP is mounted at `/boot/efi`
- Install signed boot stack:
```bash
sudo apt install shim-signed grub-efi-amd64-signed efibootmgr mokutil
sudo grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=debian
sudo update-grub
```
- Create/verify a boot entry that points to shim:
```bash
sudo efibootmgr -c -d /dev/sda -p 15 -L "debian" -l '\EFI\debian\shimx64.efi'
sudo efibootmgr -o 0002,0001,0000 # make shim (0002) first
sudo efibootmgr -n 0002 # BootNext shim for the next reboot
```
Tip: If NVRAM resets or fallback path is used, copy as a fallback:
```bash
sudo mkdir -p /boot/efi/EFI/BOOT
sudo cp /boot/efi/EFI/debian/shimx64.efi /boot/efi/EFI/BOOT/BOOTX64.EFI
sudo cp /boot/efi/EFI/debian/{mmx64.efi,grubx64.efi} /boot/efi/EFI/BOOT/
```
### Step 2 — Make DKMS sign NVIDIA modules with a MOK
Debian already generated a DKMS key at `/var/lib/dkms/mok.key`. Create an X.509 cert in DER format:
```bash
sudo openssl req -new -x509 \
-key /var/lib/dkms/mok.key \
-out /var/lib/dkms/mok.der \
-outform DER \
-subj "/CN=DKMS MOK/" \
-days 36500
```
Enable DKMS signing:
```bash
sudo sed -i 's|^mok_signing_key=.*|mok_signing_key=/var/lib/dkms/mok.key|' /etc/dkms/framework.conf
sudo sed -i 's|^mok_certificate=.*|mok_certificate=/var/lib/dkms/mok.der|' /etc/dkms/framework.conf
```
Rebuild/install modules (signs them now):
```bash
sudo dkms build nvidia/$(modinfo -F version nvidia) -k $(uname -r) --force
sudo dkms install nvidia/$(modinfo -F version nvidia) -k $(uname -r) --force
```
### Step 3 — Enroll the MOK via shim (MokManager)
Queue the cert and set a longer prompt timeout:
```bash
sudo mokutil --revoke-import
sudo mokutil --import /var/lib/dkms/mok.der
sudo mokutil --timeout 30
sudo efibootmgr -n 0002 # ensure next boot goes through shim
```
Reboot to the VM console (not SSH). In the blue MOK UI:
- Enroll MOK → Continue → Yes → enter password → reboot
If arrow keys dont work in Proxmox noVNC:
- Use SPICE (virt-viewer), or
- From the Proxmox host, send keys:
- `qm sendkey <VMID> down`, `qm sendkey <VMID> ret`, `qm sendkey <VMID> esc`
### Verification
```bash
sudo mokutil --test-key /var/lib/dkms/mok.der # “already enrolled”
sudo modprobe nvidia
nvidia-smi
kubectl -n gpu-operator get pods -o wide
```
Once the module loads, GPU Operator pods on that node leave Init and become Ready.
### Key Insights
- “Key was rejected by service” during `modprobe nvidia` means Secure Boot rejected an untrusted module.
- Without shim in the boot path (or without a persistent EFI vars disk), `mokutil --import` wont surface a MOK screen.
- DKMS will not sign modules unless configured; set `mok_signing_key` and `mok_certificate` in `/etc/dkms/framework.conf`.
- If you cannot or dont want to use MOK, the pragmatic dev choice is to disable Secure Boot in OVMF. For production, prefer shim+MOK.
### References
- Proxmox Secure Boot setup (shim + MOK, EFI vars, DKMS): [Proxmox docs](https://pve.proxmox.com/wiki/Secure_Boot_Setup#Setup_instructions_for_shim_+_MOK_variant)