Files
ericxliu-me/content/posts/secure-boot-dkms-and-mok-on-proxmox-debian.md
Automated Publisher 2a5ee6dd11
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 53s
📚 Auto-publish: Add/update 1 blog posts
Generated on: Thu Aug 14 06:50:22 UTC 2025
Source: md-personal repository
2025-08-14 06:50:22 +00:00

4.6 KiB
Raw Permalink Blame History

title, date, draft
title date draft
Fixing GPU Operator Pods Stuck in Init: Secure Boot, DKMS, and MOK on Proxmox + Debian 2025-08-09 false

I hit an issue where all GPU Operator pods on one node were stuck in Init after migrating from Legacy BIOS to UEFI. The common error was NVIDIA components waiting for “toolkit-ready,” while the toolkit init container looped with:

  • nvidia-smi failed to communicate with the NVIDIA driver
  • modprobe nvidia → “Key was rejected by service”

That message is the tell: Secure Boot is enabled and the kernel refuses to load modules not signed by a trusted key.

Environment

  • Proxmox VM (QEMU/KVM) 8.4.9
  • Debian 12 (bookworm), kernel 6.1
  • GPU: NVIDIA Tesla V100 (GV100GL)
  • NVIDIA driver installed via Debian packages (nvidia-driver, nvidia-kernel-dkms)

Root Cause

  • Secure Boot enabled (verified with mokutil --sb-state)
  • NVIDIA DKMS modules were built, but the signing key was not trusted by the UEFI shim/firmware
  • VM booted via the fallback “UEFI QEMU HARDDISK” path (not shim), so MOK requests didnt run; no MOK screen

Strategy

Keep Secure Boot on; get modules trusted. That requires:

  1. Ensure the VM boots via shim (so MOK can work)
  2. Make sure DKMS signs modules with a MOK key/cert
  3. Enroll that MOK into the firmware via shims MokManager

Step 1 — Boot via shim and persist EFI variables

In Proxmox (VM stopped):

  • BIOS: OVMF (UEFI)
  • Add EFI Disk (stores OVMF VARS; required for MOK)
  • Machine: q35
  • Enable Secure Boot (option shows only with OVMF + EFI Disk)

Inside Debian:

  • Ensure ESP is mounted at /boot/efi
  • Install signed boot stack:
    sudo apt install shim-signed grub-efi-amd64-signed efibootmgr mokutil
    sudo grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=debian
    sudo update-grub
    
  • Create/verify a boot entry that points to shim:
    sudo efibootmgr -c -d /dev/sda -p 15 -L "debian" -l '\EFI\debian\shimx64.efi'
    sudo efibootmgr -o 0002,0001,0000     # make shim (0002) first
    sudo efibootmgr -n 0002               # BootNext shim for the next reboot
    

Tip: If NVRAM resets or fallback path is used, copy as a fallback:

sudo mkdir -p /boot/efi/EFI/BOOT
sudo cp /boot/efi/EFI/debian/shimx64.efi /boot/efi/EFI/BOOT/BOOTX64.EFI
sudo cp /boot/efi/EFI/debian/{mmx64.efi,grubx64.efi} /boot/efi/EFI/BOOT/

Step 2 — Make DKMS sign NVIDIA modules with a MOK

Debian already generated a DKMS key at /var/lib/dkms/mok.key. Create an X.509 cert in DER format:

sudo openssl req -new -x509 \
  -key /var/lib/dkms/mok.key \
  -out /var/lib/dkms/mok.der \
  -outform DER \
  -subj "/CN=DKMS MOK/" \
  -days 36500

Enable DKMS signing:

sudo sed -i 's|^mok_signing_key=.*|mok_signing_key=/var/lib/dkms/mok.key|' /etc/dkms/framework.conf
sudo sed -i 's|^mok_certificate=.*|mok_certificate=/var/lib/dkms/mok.der|' /etc/dkms/framework.conf

Rebuild/install modules (signs them now):

sudo dkms build nvidia/$(modinfo -F version nvidia) -k $(uname -r) --force
sudo dkms install nvidia/$(modinfo -F version nvidia) -k $(uname -r) --force

Step 3 — Enroll the MOK via shim (MokManager)

Queue the cert and set a longer prompt timeout:

sudo mokutil --revoke-import
sudo mokutil --import /var/lib/dkms/mok.der
sudo mokutil --timeout 30
sudo efibootmgr -n 0002  # ensure next boot goes through shim

Reboot to the VM console (not SSH). In the blue MOK UI:

  • Enroll MOK → Continue → Yes → enter password → reboot

If arrow keys dont work in Proxmox noVNC:

  • Use SPICE (virt-viewer), or
  • From the Proxmox host, send keys:
    • qm sendkey <VMID> down, qm sendkey <VMID> ret, qm sendkey <VMID> esc

Verification

sudo mokutil --test-key /var/lib/dkms/mok.der   # “already enrolled”
sudo modprobe nvidia
nvidia-smi
kubectl -n gpu-operator get pods -o wide

Once the module loads, GPU Operator pods on that node leave Init and become Ready.

Key Insights

  • “Key was rejected by service” during modprobe nvidia means Secure Boot rejected an untrusted module.
  • Without shim in the boot path (or without a persistent EFI vars disk), mokutil --import wont surface a MOK screen.
  • DKMS will not sign modules unless configured; set mok_signing_key and mok_certificate in /etc/dkms/framework.conf.
  • If you cannot or dont want to use MOK, the pragmatic dev choice is to disable Secure Boot in OVMF. For production, prefer shim+MOK.

References

  • Proxmox Secure Boot setup (shim + MOK, EFI vars, DKMS): Proxmox docs