--- title: "Fixing GPU Operator Pods Stuck in Init: Secure Boot, DKMS, and MOK on Proxmox + Debian" date: 2025-08-09 draft: false --- I hit an issue where all GPU Operator pods on one node were stuck in Init after migrating from Legacy BIOS to UEFI. The common error was NVIDIA components waiting for “toolkit-ready,” while the toolkit init container looped with: - nvidia-smi failed to communicate with the NVIDIA driver - modprobe nvidia → “Key was rejected by service” That message is the tell: Secure Boot is enabled and the kernel refuses to load modules not signed by a trusted key. ### Environment - Proxmox VM (QEMU/KVM) 8.4.9 - Debian 12 (bookworm), kernel 6.1 - GPU: NVIDIA Tesla V100 (GV100GL) - NVIDIA driver installed via Debian packages (nvidia-driver, nvidia-kernel-dkms) ### Root Cause - Secure Boot enabled (verified with `mokutil --sb-state`) - NVIDIA DKMS modules were built, but the signing key was not trusted by the UEFI shim/firmware - VM booted via the fallback “UEFI QEMU HARDDISK” path (not shim), so MOK requests didn’t run; no MOK screen ### Strategy Keep Secure Boot on; get modules trusted. That requires: 1) Ensure the VM boots via shim (so MOK can work) 2) Make sure DKMS signs modules with a MOK key/cert 3) Enroll that MOK into the firmware via shim’s MokManager ### Step 1 — Boot via shim and persist EFI variables In Proxmox (VM stopped): - BIOS: OVMF (UEFI) - Add EFI Disk (stores OVMF VARS; required for MOK) - Machine: q35 - Enable Secure Boot (option shows only with OVMF + EFI Disk) Inside Debian: - Ensure ESP is mounted at `/boot/efi` - Install signed boot stack: ```bash sudo apt install shim-signed grub-efi-amd64-signed efibootmgr mokutil sudo grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=debian sudo update-grub ``` - Create/verify a boot entry that points to shim: ```bash sudo efibootmgr -c -d /dev/sda -p 15 -L "debian" -l '\EFI\debian\shimx64.efi' sudo efibootmgr -o 0002,0001,0000 # make shim (0002) first sudo efibootmgr -n 0002 # BootNext shim for the next reboot ``` Tip: If NVRAM resets or fallback path is used, copy as a fallback: ```bash sudo mkdir -p /boot/efi/EFI/BOOT sudo cp /boot/efi/EFI/debian/shimx64.efi /boot/efi/EFI/BOOT/BOOTX64.EFI sudo cp /boot/efi/EFI/debian/{mmx64.efi,grubx64.efi} /boot/efi/EFI/BOOT/ ``` ### Step 2 — Make DKMS sign NVIDIA modules with a MOK Debian already generated a DKMS key at `/var/lib/dkms/mok.key`. Create an X.509 cert in DER format: ```bash sudo openssl req -new -x509 \ -key /var/lib/dkms/mok.key \ -out /var/lib/dkms/mok.der \ -outform DER \ -subj "/CN=DKMS MOK/" \ -days 36500 ``` Enable DKMS signing: ```bash sudo sed -i 's|^mok_signing_key=.*|mok_signing_key=/var/lib/dkms/mok.key|' /etc/dkms/framework.conf sudo sed -i 's|^mok_certificate=.*|mok_certificate=/var/lib/dkms/mok.der|' /etc/dkms/framework.conf ``` Rebuild/install modules (signs them now): ```bash sudo dkms build nvidia/$(modinfo -F version nvidia) -k $(uname -r) --force sudo dkms install nvidia/$(modinfo -F version nvidia) -k $(uname -r) --force ``` ### Step 3 — Enroll the MOK via shim (MokManager) Queue the cert and set a longer prompt timeout: ```bash sudo mokutil --revoke-import sudo mokutil --import /var/lib/dkms/mok.der sudo mokutil --timeout 30 sudo efibootmgr -n 0002 # ensure next boot goes through shim ``` Reboot to the VM console (not SSH). In the blue MOK UI: - Enroll MOK → Continue → Yes → enter password → reboot If arrow keys don’t work in Proxmox noVNC: - Use SPICE (virt-viewer), or - From the Proxmox host, send keys: - `qm sendkey down`, `qm sendkey ret`, `qm sendkey esc` ### Verification ```bash sudo mokutil --test-key /var/lib/dkms/mok.der # “already enrolled” sudo modprobe nvidia nvidia-smi kubectl -n gpu-operator get pods -o wide ``` Once the module loads, GPU Operator pods on that node leave Init and become Ready. ### Key Insights - “Key was rejected by service” during `modprobe nvidia` means Secure Boot rejected an untrusted module. - Without shim in the boot path (or without a persistent EFI vars disk), `mokutil --import` won’t surface a MOK screen. - DKMS will not sign modules unless configured; set `mok_signing_key` and `mok_certificate` in `/etc/dkms/framework.conf`. - If you cannot or don’t want to use MOK, the pragmatic dev choice is to disable Secure Boot in OVMF. For production, prefer shim+MOK. ### References - Proxmox Secure Boot setup (shim + MOK, EFI vars, DKMS): [Proxmox docs](https://pve.proxmox.com/wiki/Secure_Boot_Setup#Setup_instructions_for_shim_+_MOK_variant)