📚 Auto-publish: Add/update 1 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 53s
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 53s
Generated on: Thu Aug 14 06:50:22 UTC 2025 Source: md-personal repository
This commit is contained in:
111
content/posts/secure-boot-dkms-and-mok-on-proxmox-debian.md
Normal file
111
content/posts/secure-boot-dkms-and-mok-on-proxmox-debian.md
Normal file
@@ -0,0 +1,111 @@
|
||||
---
|
||||
title: "Fixing GPU Operator Pods Stuck in Init: Secure Boot, DKMS, and MOK on Proxmox + Debian"
|
||||
date: 2025-08-09
|
||||
draft: false
|
||||
---
|
||||
|
||||
I hit an issue where all GPU Operator pods on one node were stuck in Init after migrating from Legacy BIOS to UEFI. The common error was NVIDIA components waiting for “toolkit-ready,” while the toolkit init container looped with:
|
||||
- nvidia-smi failed to communicate with the NVIDIA driver
|
||||
- modprobe nvidia → “Key was rejected by service”
|
||||
|
||||
That message is the tell: Secure Boot is enabled and the kernel refuses to load modules not signed by a trusted key.
|
||||
|
||||
### Environment
|
||||
- Proxmox VM (QEMU/KVM) 8.4.9
|
||||
- Debian 12 (bookworm), kernel 6.1
|
||||
- GPU: NVIDIA Tesla V100 (GV100GL)
|
||||
- NVIDIA driver installed via Debian packages (nvidia-driver, nvidia-kernel-dkms)
|
||||
|
||||
### Root Cause
|
||||
- Secure Boot enabled (verified with `mokutil --sb-state`)
|
||||
- NVIDIA DKMS modules were built, but the signing key was not trusted by the UEFI shim/firmware
|
||||
- VM booted via the fallback “UEFI QEMU HARDDISK” path (not shim), so MOK requests didn’t run; no MOK screen
|
||||
|
||||
### Strategy
|
||||
Keep Secure Boot on; get modules trusted. That requires:
|
||||
1) Ensure the VM boots via shim (so MOK can work)
|
||||
2) Make sure DKMS signs modules with a MOK key/cert
|
||||
3) Enroll that MOK into the firmware via shim’s MokManager
|
||||
|
||||
### Step 1 — Boot via shim and persist EFI variables
|
||||
In Proxmox (VM stopped):
|
||||
- BIOS: OVMF (UEFI)
|
||||
- Add EFI Disk (stores OVMF VARS; required for MOK)
|
||||
- Machine: q35
|
||||
- Enable Secure Boot (option shows only with OVMF + EFI Disk)
|
||||
|
||||
Inside Debian:
|
||||
- Ensure ESP is mounted at `/boot/efi`
|
||||
- Install signed boot stack:
|
||||
```bash
|
||||
sudo apt install shim-signed grub-efi-amd64-signed efibootmgr mokutil
|
||||
sudo grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=debian
|
||||
sudo update-grub
|
||||
```
|
||||
- Create/verify a boot entry that points to shim:
|
||||
```bash
|
||||
sudo efibootmgr -c -d /dev/sda -p 15 -L "debian" -l '\EFI\debian\shimx64.efi'
|
||||
sudo efibootmgr -o 0002,0001,0000 # make shim (0002) first
|
||||
sudo efibootmgr -n 0002 # BootNext shim for the next reboot
|
||||
```
|
||||
Tip: If NVRAM resets or fallback path is used, copy as a fallback:
|
||||
```bash
|
||||
sudo mkdir -p /boot/efi/EFI/BOOT
|
||||
sudo cp /boot/efi/EFI/debian/shimx64.efi /boot/efi/EFI/BOOT/BOOTX64.EFI
|
||||
sudo cp /boot/efi/EFI/debian/{mmx64.efi,grubx64.efi} /boot/efi/EFI/BOOT/
|
||||
```
|
||||
|
||||
### Step 2 — Make DKMS sign NVIDIA modules with a MOK
|
||||
Debian already generated a DKMS key at `/var/lib/dkms/mok.key`. Create an X.509 cert in DER format:
|
||||
```bash
|
||||
sudo openssl req -new -x509 \
|
||||
-key /var/lib/dkms/mok.key \
|
||||
-out /var/lib/dkms/mok.der \
|
||||
-outform DER \
|
||||
-subj "/CN=DKMS MOK/" \
|
||||
-days 36500
|
||||
```
|
||||
Enable DKMS signing:
|
||||
```bash
|
||||
sudo sed -i 's|^mok_signing_key=.*|mok_signing_key=/var/lib/dkms/mok.key|' /etc/dkms/framework.conf
|
||||
sudo sed -i 's|^mok_certificate=.*|mok_certificate=/var/lib/dkms/mok.der|' /etc/dkms/framework.conf
|
||||
```
|
||||
Rebuild/install modules (signs them now):
|
||||
```bash
|
||||
sudo dkms build nvidia/$(modinfo -F version nvidia) -k $(uname -r) --force
|
||||
sudo dkms install nvidia/$(modinfo -F version nvidia) -k $(uname -r) --force
|
||||
```
|
||||
|
||||
### Step 3 — Enroll the MOK via shim (MokManager)
|
||||
Queue the cert and set a longer prompt timeout:
|
||||
```bash
|
||||
sudo mokutil --revoke-import
|
||||
sudo mokutil --import /var/lib/dkms/mok.der
|
||||
sudo mokutil --timeout 30
|
||||
sudo efibootmgr -n 0002 # ensure next boot goes through shim
|
||||
```
|
||||
Reboot to the VM console (not SSH). In the blue MOK UI:
|
||||
- Enroll MOK → Continue → Yes → enter password → reboot
|
||||
|
||||
If arrow keys don’t work in Proxmox noVNC:
|
||||
- Use SPICE (virt-viewer), or
|
||||
- From the Proxmox host, send keys:
|
||||
- `qm sendkey <VMID> down`, `qm sendkey <VMID> ret`, `qm sendkey <VMID> esc`
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
sudo mokutil --test-key /var/lib/dkms/mok.der # “already enrolled”
|
||||
sudo modprobe nvidia
|
||||
nvidia-smi
|
||||
kubectl -n gpu-operator get pods -o wide
|
||||
```
|
||||
Once the module loads, GPU Operator pods on that node leave Init and become Ready.
|
||||
|
||||
### Key Insights
|
||||
- “Key was rejected by service” during `modprobe nvidia` means Secure Boot rejected an untrusted module.
|
||||
- Without shim in the boot path (or without a persistent EFI vars disk), `mokutil --import` won’t surface a MOK screen.
|
||||
- DKMS will not sign modules unless configured; set `mok_signing_key` and `mok_certificate` in `/etc/dkms/framework.conf`.
|
||||
- If you cannot or don’t want to use MOK, the pragmatic dev choice is to disable Secure Boot in OVMF. For production, prefer shim+MOK.
|
||||
|
||||
### References
|
||||
- Proxmox Secure Boot setup (shim + MOK, EFI vars, DKMS): [Proxmox docs](https://pve.proxmox.com/wiki/Secure_Boot_Setup#Setup_instructions_for_shim_+_MOK_variant)
|
Reference in New Issue
Block a user