Files
ericxliu-me/posts/secure-boot-dkms-and-mok-on-proxmox-debian/index.html
2025-08-20 06:29:20 +00:00

62 lines
18 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!doctype html><html lang=en><head><title>Fixing GPU Operator Pods Stuck in Init: Secure Boot, DKMS, and MOK on Proxmox + Debian · Eric X. Liu's Personal Page</title><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=color-scheme content="light dark"><meta name=author content="Eric X. Liu"><meta name=description content="I hit an issue where all GPU Operator pods on one node were stuck in Init after migrating from Legacy BIOS to UEFI. The common error was NVIDIA components waiting for “toolkit-ready,” while the toolkit init container looped with:
nvidia-smi failed to communicate with the NVIDIA driver
modprobe nvidia → “Key was rejected by service”
That message is the tell: Secure Boot is enabled and the kernel refuses to load modules not signed by a trusted key."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=twitter:card content="summary"><meta name=twitter:title content="Fixing GPU Operator Pods Stuck in Init: Secure Boot, DKMS, and MOK on Proxmox + Debian"><meta name=twitter:description content="I hit an issue where all GPU Operator pods on one node were stuck in Init after migrating from Legacy BIOS to UEFI. The common error was NVIDIA components waiting for “toolkit-ready,” while the toolkit init container looped with:
nvidia-smi failed to communicate with the NVIDIA driver modprobe nvidia → “Key was rejected by service” That message is the tell: Secure Boot is enabled and the kernel refuses to load modules not signed by a trusted key."><meta property="og:url" content="/posts/secure-boot-dkms-and-mok-on-proxmox-debian/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content="Fixing GPU Operator Pods Stuck in Init: Secure Boot, DKMS, and MOK on Proxmox + Debian"><meta property="og:description" content="I hit an issue where all GPU Operator pods on one node were stuck in Init after migrating from Legacy BIOS to UEFI. The common error was NVIDIA components waiting for “toolkit-ready,” while the toolkit init container looped with:
nvidia-smi failed to communicate with the NVIDIA driver modprobe nvidia → “Key was rejected by service” That message is the tell: Secure Boot is enabled and the kernel refuses to load modules not signed by a trusted key."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2025-08-09T00:00:00+00:00"><meta property="article:modified_time" content="2025-08-14T06:50:22+00:00"><link rel=canonical href=/posts/secure-boot-dkms-and-mok-on-proxmox-debian/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.6445a802b9389c9660e1b07b724dcf5718b1065ed2d71b4eeaf981cc7cc5fc46.css integrity="sha256-ZEWoArk4nJZg4bB7ck3PVxixBl7S1xtO6vmBzHzF/EY=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=/>Eric X. Liu's Personal Page
</a><input type=checkbox id=menu-toggle>
<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=/posts/secure-boot-dkms-and-mok-on-proxmox-debian/>Fixing GPU Operator Pods Stuck in Init: Secure Boot, DKMS, and MOK on Proxmox + Debian</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
<time datetime=2025-08-09T00:00:00Z>August 9, 2025
</time></span><span class=reading-time><i class="fa-solid fa-clock" aria-hidden=true></i>
3-minute read</span></div></div></header><div class=post-content><p>I hit an issue where all GPU Operator pods on one node were stuck in Init after migrating from Legacy BIOS to UEFI. The common error was NVIDIA components waiting for “toolkit-ready,” while the toolkit init container looped with:</p><ul><li>nvidia-smi failed to communicate with the NVIDIA driver</li><li>modprobe nvidia → “Key was rejected by service”</li></ul><p>That message is the tell: Secure Boot is enabled and the kernel refuses to load modules not signed by a trusted key.</p><h3 id=environment>Environment
<a class=heading-link href=#environment><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><ul><li>Proxmox VM (QEMU/KVM) 8.4.9</li><li>Debian 12 (bookworm), kernel 6.1</li><li>GPU: NVIDIA Tesla V100 (GV100GL)</li><li>NVIDIA driver installed via Debian packages (nvidia-driver, nvidia-kernel-dkms)</li></ul><h3 id=root-cause>Root Cause
<a class=heading-link href=#root-cause><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><ul><li>Secure Boot enabled (verified with <code>mokutil --sb-state</code>)</li><li>NVIDIA DKMS modules were built, but the signing key was not trusted by the UEFI shim/firmware</li><li>VM booted via the fallback “UEFI QEMU HARDDISK” path (not shim), so MOK requests didnt run; no MOK screen</li></ul><h3 id=strategy>Strategy
<a class=heading-link href=#strategy><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>Keep Secure Boot on; get modules trusted. That requires:</p><ol><li>Ensure the VM boots via shim (so MOK can work)</li><li>Make sure DKMS signs modules with a MOK key/cert</li><li>Enroll that MOK into the firmware via shims MokManager</li></ol><h3 id=step-1--boot-via-shim-and-persist-efi-variables>Step 1 — Boot via shim and persist EFI variables
<a class=heading-link href=#step-1--boot-via-shim-and-persist-efi-variables><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>In Proxmox (VM stopped):</p><ul><li>BIOS: OVMF (UEFI)</li><li>Add EFI Disk (stores OVMF VARS; required for MOK)</li><li>Machine: q35</li><li>Enable Secure Boot (option shows only with OVMF + EFI Disk)</li></ul><p>Inside Debian:</p><ul><li>Ensure ESP is mounted at <code>/boot/efi</code></li><li>Install signed boot stack:<div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-bash data-lang=bash><span style=display:flex><span>sudo apt install shim-signed grub-efi-amd64-signed efibootmgr mokutil
</span></span><span style=display:flex><span>sudo grub-install --target<span style=color:#ff7b72;font-weight:700>=</span>x86_64-efi --efi-directory<span style=color:#ff7b72;font-weight:700>=</span>/boot/efi --bootloader-id<span style=color:#ff7b72;font-weight:700>=</span>debian
</span></span><span style=display:flex><span>sudo update-grub
</span></span></code></pre></div></li><li>Create/verify a boot entry that points to shim:<div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-bash data-lang=bash><span style=display:flex><span>sudo efibootmgr -c -d /dev/sda -p <span style=color:#a5d6ff>15</span> -L <span style=color:#a5d6ff>&#34;debian&#34;</span> -l <span style=color:#a5d6ff>&#39;\EFI\debian\shimx64.efi&#39;</span>
</span></span><span style=display:flex><span>sudo efibootmgr -o 0002,0001,0000 <span style=color:#8b949e;font-style:italic># make shim (0002) first</span>
</span></span><span style=display:flex><span>sudo efibootmgr -n <span style=color:#a5d6ff>0002</span> <span style=color:#8b949e;font-style:italic># BootNext shim for the next reboot</span>
</span></span></code></pre></div></li></ul><p>Tip: If NVRAM resets or fallback path is used, copy as a fallback:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-bash data-lang=bash><span style=display:flex><span>sudo mkdir -p /boot/efi/EFI/BOOT
</span></span><span style=display:flex><span>sudo cp /boot/efi/EFI/debian/shimx64.efi /boot/efi/EFI/BOOT/BOOTX64.EFI
</span></span><span style=display:flex><span>sudo cp /boot/efi/EFI/debian/<span style=color:#ff7b72;font-weight:700>{</span>mmx64.efi,grubx64.efi<span style=color:#ff7b72;font-weight:700>}</span> /boot/efi/EFI/BOOT/
</span></span></code></pre></div><h3 id=step-2--make-dkms-sign-nvidia-modules-with-a-mok>Step 2 — Make DKMS sign NVIDIA modules with a MOK
<a class=heading-link href=#step-2--make-dkms-sign-nvidia-modules-with-a-mok><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>Debian already generated a DKMS key at <code>/var/lib/dkms/mok.key</code>. Create an X.509 cert in DER format:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-bash data-lang=bash><span style=display:flex><span>sudo openssl req -new -x509 <span style=color:#79c0ff>\
</span></span></span><span style=display:flex><span><span style=color:#79c0ff></span> -key /var/lib/dkms/mok.key <span style=color:#79c0ff>\
</span></span></span><span style=display:flex><span><span style=color:#79c0ff></span> -out /var/lib/dkms/mok.der <span style=color:#79c0ff>\
</span></span></span><span style=display:flex><span><span style=color:#79c0ff></span> -outform DER <span style=color:#79c0ff>\
</span></span></span><span style=display:flex><span><span style=color:#79c0ff></span> -subj <span style=color:#a5d6ff>&#34;/CN=DKMS MOK/&#34;</span> <span style=color:#79c0ff>\
</span></span></span><span style=display:flex><span><span style=color:#79c0ff></span> -days <span style=color:#a5d6ff>36500</span>
</span></span></code></pre></div><p>Enable DKMS signing:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-bash data-lang=bash><span style=display:flex><span>sudo sed -i <span style=color:#a5d6ff>&#39;s|^mok_signing_key=.*|mok_signing_key=/var/lib/dkms/mok.key|&#39;</span> /etc/dkms/framework.conf
</span></span><span style=display:flex><span>sudo sed -i <span style=color:#a5d6ff>&#39;s|^mok_certificate=.*|mok_certificate=/var/lib/dkms/mok.der|&#39;</span> /etc/dkms/framework.conf
</span></span></code></pre></div><p>Rebuild/install modules (signs them now):</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-bash data-lang=bash><span style=display:flex><span>sudo dkms build nvidia/<span style=color:#ff7b72>$(</span>modinfo -F version nvidia<span style=color:#ff7b72>)</span> -k <span style=color:#ff7b72>$(</span>uname -r<span style=color:#ff7b72>)</span> --force
</span></span><span style=display:flex><span>sudo dkms install nvidia/<span style=color:#ff7b72>$(</span>modinfo -F version nvidia<span style=color:#ff7b72>)</span> -k <span style=color:#ff7b72>$(</span>uname -r<span style=color:#ff7b72>)</span> --force
</span></span></code></pre></div><h3 id=step-3--enroll-the-mok-via-shim-mokmanager>Step 3 — Enroll the MOK via shim (MokManager)
<a class=heading-link href=#step-3--enroll-the-mok-via-shim-mokmanager><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>Queue the cert and set a longer prompt timeout:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-bash data-lang=bash><span style=display:flex><span>sudo mokutil --revoke-import
</span></span><span style=display:flex><span>sudo mokutil --import /var/lib/dkms/mok.der
</span></span><span style=display:flex><span>sudo mokutil --timeout <span style=color:#a5d6ff>30</span>
</span></span><span style=display:flex><span>sudo efibootmgr -n <span style=color:#a5d6ff>0002</span> <span style=color:#8b949e;font-style:italic># ensure next boot goes through shim</span>
</span></span></code></pre></div><p>Reboot to the VM console (not SSH). In the blue MOK UI:</p><ul><li>Enroll MOK → Continue → Yes → enter password → reboot</li></ul><p>If arrow keys dont work in Proxmox noVNC:</p><ul><li>Use SPICE (virt-viewer), or</li><li>From the Proxmox host, send keys:<ul><li><code>qm sendkey &lt;VMID> down</code>, <code>qm sendkey &lt;VMID> ret</code>, <code>qm sendkey &lt;VMID> esc</code></li></ul></li></ul><h3 id=verification>Verification
<a class=heading-link href=#verification><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-bash data-lang=bash><span style=display:flex><span>sudo mokutil --test-key /var/lib/dkms/mok.der <span style=color:#8b949e;font-style:italic># “already enrolled”</span>
</span></span><span style=display:flex><span>sudo modprobe nvidia
</span></span><span style=display:flex><span>nvidia-smi
</span></span><span style=display:flex><span>kubectl -n gpu-operator get pods -o wide
</span></span></code></pre></div><p>Once the module loads, GPU Operator pods on that node leave Init and become Ready.</p><h3 id=key-insights>Key Insights
<a class=heading-link href=#key-insights><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><ul><li>“Key was rejected by service” during <code>modprobe nvidia</code> means Secure Boot rejected an untrusted module.</li><li>Without shim in the boot path (or without a persistent EFI vars disk), <code>mokutil --import</code> wont surface a MOK screen.</li><li>DKMS will not sign modules unless configured; set <code>mok_signing_key</code> and <code>mok_certificate</code> in <code>/etc/dkms/framework.conf</code>.</li><li>If you cannot or dont want to use MOK, the pragmatic dev choice is to disable Secure Boot in OVMF. For production, prefer shim+MOK.</li></ul><h3 id=references>References
<a class=heading-link href=#references><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><ul><li>Proxmox Secure Boot setup (shim + MOK, EFI vars, DKMS): <a href=https://pve.proxmox.com/wiki/Secure_Boot_Setup#Setup_instructions_for_shim_+_MOK_variant class=external-link target=_blank rel=noopener>Proxmox docs</a></li></ul></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css integrity=sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0 crossorigin=anonymous><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js integrity=sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4 crossorigin=anonymous></script><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js integrity=sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05 crossorigin=anonymous onload='renderMathInElement(document.body,{delimiters:[{left:"$$",right:"$$",display:!0},{left:"$",right:"$",display:!1},{left:"\\(",right:"\\)",display:!1},{left:"\\[",right:"\\]",display:!0}]})'></script></section></div><footer class=footer><section class=container>©
2016 -
2025
Eric X. Liu
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/69c2890">[69c2890]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>