Files
2026-01-22 06:48:23 +00:00

47 lines
23 KiB
HTML

<!doctype html><html lang=en><head><title>Why Your "Resilient" Homelab is Slower Than a Raspberry Pi · Eric X. Liu's Personal Page</title><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=color-scheme content="light dark"><meta http-equiv=Content-Security-Policy content="upgrade-insecure-requests; block-all-mixed-content; default-src 'self'; child-src 'self'; font-src 'self' https://fonts.gstatic.com https://cdn.jsdelivr.net/; form-action 'self'; frame-src 'self' https://www.youtube.com https://disqus.com; img-src 'self' https://referrer.disqus.com https://c.disquscdn.com https://*.disqus.com; object-src 'none'; style-src 'self' 'unsafe-inline' https://fonts.googleapis.com/ https://cdn.jsdelivr.net/; script-src 'self' 'unsafe-inline' https://www.google-analytics.com https://cdn.jsdelivr.net/ https://pagead2.googlesyndication.com https://static.cloudflareinsights.com https://unpkg.com https://ericxliu-me.disqus.com https://disqus.com https://*.disqus.com https://*.disquscdn.com https://unpkg.com; connect-src 'self' https://www.google-analytics.com https://pagead2.googlesyndication.com https://cloudflareinsights.com ws://localhost:1313 ws://localhost:* wss://localhost:* https://links.services.disqus.com https://*.disqus.com;"><meta name=author content="Eric X. Liu"><meta name=description content="In the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running &ldquo;production&rdquo; at home, there is only one metric that truly matters: The Wife Acceptance Factor (WAF).
My detailed Grafana dashboards said everything was fine. But my wife said the SSO login was &ldquo;slow sometimes.&rdquo; She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=twitter:card content="summary"><meta name=twitter:title content='Why Your "Resilient" Homelab is Slower Than a Raspberry Pi'><meta name=twitter:description content="In the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running “production” at home, there is only one metric that truly matters: The Wife Acceptance Factor (WAF).
My detailed Grafana dashboards said everything was fine. But my wife said the SSO login was “slow sometimes.” She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage."><meta property="og:url" content="https://ericxliu.me/posts/debugging-authentik-performance/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content='Why Your "Resilient" Homelab is Slower Than a Raspberry Pi'><meta property="og:description" content="In the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running “production” at home, there is only one metric that truly matters: The Wife Acceptance Factor (WAF).
My detailed Grafana dashboards said everything was fine. But my wife said the SSO login was “slow sometimes.” She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2026-01-02T00:00:00+00:00"><meta property="article:modified_time" content="2026-01-03T06:57:12+00:00"><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=canonical href=https://ericxliu.me/posts/debugging-authentik-performance/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.4b392a85107b91dbdabc528edf014a6ab1a30cd44cafcd5325c8efe796794fca.css integrity="sha256-SzkqhRB7kdvavFKO3wFKarGjDNRMr81TJcjv55Z5T8o=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5><script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-3972604619956476" crossorigin=anonymous></script><script type=application/ld+json>{"@context":"http://schema.org","@type":"Person","name":"Eric X. Liu","url":"https:\/\/ericxliu.me\/","description":"Software \u0026 Performance Engineer at Google","sameAs":["https:\/\/www.linkedin.com\/in\/eric-x-liu-46648b93\/","https:\/\/git.ericxliu.me\/eric"]}</script><script type=application/ld+json>{"@context":"http://schema.org","@type":"BlogPosting","headline":"Why Your \u0022Resilient\u0022 Homelab is Slower Than a Raspberry Pi","genre":"Blog","wordcount":"1031","url":"https:\/\/ericxliu.me\/posts\/debugging-authentik-performance\/","datePublished":"2026-01-02T00:00:00\u002b00:00","dateModified":"2026-01-03T06:57:12\u002b00:00","description":"\u003cp\u003eIn the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running \u0026ldquo;production\u0026rdquo; at home, there is only one metric that truly matters: \u003cstrong\u003eThe Wife Acceptance Factor (WAF)\u003c\/strong\u003e.\u003c\/p\u003e\n\u003cp\u003eMy detailed Grafana dashboards said everything was fine. But my wife said the SSO login was \u0026ldquo;slow sometimes.\u0026rdquo; She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage.\u003c\/p\u003e","author":{"@type":"Person","name":"Eric X. Liu"}}</script></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=https://ericxliu.me/>Eric X. Liu's Personal Page
</a><input type=checkbox id=menu-toggle>
<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=/about/>About</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=https://ericxliu.me/posts/debugging-authentik-performance/>Why Your "Resilient" Homelab is Slower Than a Raspberry Pi</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
<time datetime=2026-01-02T00:00:00Z>January 2, 2026
</time></span><span class=reading-time><i class="fa-solid fa-clock" aria-hidden=true></i>
5-minute read</span></div></div></header><div class=post-content><p>In the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running &ldquo;production&rdquo; at home, there is only one metric that truly matters: <strong>The Wife Acceptance Factor (WAF)</strong>.</p><p>My detailed Grafana dashboards said everything was fine. But my wife said the SSO login was &ldquo;slow sometimes.&rdquo; She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage.</p><p>Here is a breakdown of the symptoms, the red herrings, and the root cause that was hiding in plain sight.</p><h2 id=the-environment>The Environment
<a class=heading-link href=#the-environment><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>My homelab is designed for node-level resilience, which adds complexity to the storage layer. It is not running on a single server, but rather a 3-node <strong>Proxmox</strong> cluster where every component is redundant:</p><ul><li><strong>Orchestration</strong>: Kubernetes (k3s) managed via Flux CD.</li><li><strong>Storage</strong>: A <strong>Ceph</strong> cluster running on the Proxmox nodes, utilizing enterprise NVMe SSDs (<code>bluestore</code>) for OSDs.</li><li><strong>Database</strong>: Postgres managed by the Zalando Postgres Operator, with persistent volumes (PVCs) provisioned on Ceph RBD (block storage).</li><li><strong>Identity</strong>: Authentik for SSO.</li></ul><p>While the underlying disks are blazing fast NVMe drives, the architecture dictates that a write to a Ceph RBD volume is not complete until it is replicated over the network and acknowledged by multiple OSDs. This setup provides incredible resilience—I can pull the plug on a node and nothing stops—but it introduces unavoidable network latency for synchronous write operations. <strong>Keep this particular trade-off in mind; it plays a starring role in the investigation later.</strong></p><h2 id=the-symptom>The Symptom
<a class=heading-link href=#the-symptom><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>The issue was insidious because it was intermittent. Clicking &ldquo;Login&rdquo; would sometimes hang for 5-8 seconds, while other times it was instant. To an engineer, &ldquo;sometimes slow&rdquo; is the worst kind of bug because it defies easy reproduction.</p><p>The breakthrough came when I put aside the server-side Grafana dashboards and looked at the client side. By opening Chrome DevTools and monitoring the <strong>Network</strong> tab during a slow login attempt, I was able to capture the exact failing request.</p><p>I identified the culprit: the <code>/api/v3/core/applications/</code> endpoint. It wasn&rsquo;t a connection timeout or a DNS issue; the server was simply taking 5+ seconds to respond to this specific GET request.</p><p>Armed with this &ldquo;smoking gun,&rdquo; I copied the request as cURL (preserving the session cookies) and converted it into a Python benchmark script (<code>reproduce_latency.py</code>). This allowed me to reliably trigger the latency on demand, turning an intermittent &ldquo;heisenbug&rdquo; into a reproducible test case.</p><p>The results were validating and horrifying:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-text data-lang=text><span style=display:flex><span>Request 1: 2.1642s
</span></span><span style=display:flex><span>Request 2: 8.4321s
</span></span><span style=display:flex><span>Request 3: 5.1234s
</span></span><span style=display:flex><span>...
</span></span><span style=display:flex><span>Avg Latency: 4.8s
</span></span></code></pre></div><h2 id=investigation--red-herrings>Investigation & Red Herrings
<a class=heading-link href=#investigation--red-herrings><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><h3 id=attempt-1-the-connection-overhead-hypothesis>Attempt 1: The Connection Overhead Hypothesis
<a class=heading-link href=#attempt-1-the-connection-overhead-hypothesis><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p><strong>The Hypothesis</strong>: Authentik defaults to <code>CONN_MAX_AGE=0</code>, meaning it closes the database connection after every request. Since I enforce SSL for the database, I assumed the handshake overhead was killing performance.</p><p><strong>The Fix Attempt</strong>: I updated the Authentik configuration to enable persistent connections:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-yaml data-lang=yaml><span style=display:flex><span><span style=color:#7ee787>env</span>:<span style=color:#6e7681>
</span></span></span><span style=display:flex><span><span style=color:#6e7681> </span>- <span style=color:#7ee787>name</span>:<span style=color:#6e7681> </span><span style=color:#a5d6ff>AUTHENTIK_POSTGRESQL__CONN_MAX_AGE</span><span style=color:#6e7681>
</span></span></span><span style=display:flex><span><span style=color:#6e7681> </span><span style=color:#7ee787>value</span>:<span style=color:#6e7681> </span><span style=color:#a5d6ff>&#34;600&#34;</span><span style=color:#6e7681>
</span></span></span></code></pre></div><p><strong>The Reality</strong>: The benchmark showed a slight improvement (~4.2s average), but the random 5-8s spikes remained. The 300ms connection setup was a factor, but not the root cause. As a side note, enabling this without configuring TCP Keepalives caused the Authentik worker to crash with <code>OperationalError('the connection is closed')</code> when firewalls silently dropped idle connections.</p><h3 id=attempt-2-cpu-starvation>Attempt 2: CPU Starvation
<a class=heading-link href=#attempt-2-cpu-starvation><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p><strong>The Hypothesis</strong>: The pods were CPU throttled during request processing.</p><p><strong>The Reality</strong>: <code>kubectl top pods</code> showed the server using only 29m (2.9% of a core). Even increasing the Gunicorn worker count from 2 to 4 did not improve the latency of individual requests, though it did help with concurrency.</p><h2 id=the-root-cause-a-perfect-storm>The Root Cause: A Perfect Storm
<a class=heading-link href=#the-root-cause-a-perfect-storm><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>I was stuck. The CPU was idle, network was fine, and individual database queries were fast (&lt;1ms). Then I looked at the traffic patterns:</p><ol><li><strong>Redis</strong>: Almost zero traffic.</li><li><strong>Postgres</strong>: High <code>WALSync</code> and <code>WALWrite</code> wait times.</li><li><strong>The Table</strong>: <code>django_postgres_cache_cacheentry</code> was getting hammered.</li></ol><h3 id=insight-the-breaking-change>Insight: The Breaking Change
<a class=heading-link href=#insight-the-breaking-change><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>I checked the release notes for <strong>Authentik 2025.10</strong>:</p><blockquote><p><em>Breaking Change: Redis is no longer used for caching. All caching has been moved to the PostgreSQL database to simplify deployment.</em></p></blockquote><p>This architectural shift created a bottleneck specific to my storage backend:</p><ol><li><strong>The Change</strong>: Every API request triggers a cache write (session updates) to Postgres instead of Redis.</li><li><strong>The Default</strong>: Postgres defaults to <code>synchronous_commit = on</code>. A transaction is not considered &ldquo;committed&rdquo; until it is flushed to disk.</li><li><strong>The Storage</strong>: Ceph RBD replicates data across the network to multiple OSDs.</li></ol><p>Every time I loaded the dashboard, Authentik tried to update the cache. Postgres paused, verified the write was replicated to 3 other servers over the network (WAL Sync), and <em>then</em> responded.</p><h2 id=the-solution>The Solution
<a class=heading-link href=#the-solution><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>I couldn&rsquo;t move the database to local NVMe without losing the failover capabilities I built the cluster for. However, for a cache-heavy workload, I could compromise on strict durability.</p><p>I patched the Postgres configuration to disable synchronous commits:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-yaml data-lang=yaml><span style=display:flex><span><span style=color:#7ee787>spec</span>:<span style=color:#6e7681>
</span></span></span><span style=display:flex><span><span style=color:#6e7681> </span><span style=color:#7ee787>postgresql</span>:<span style=color:#6e7681>
</span></span></span><span style=display:flex><span><span style=color:#6e7681> </span><span style=color:#7ee787>parameters</span>:<span style=color:#6e7681>
</span></span></span><span style=display:flex><span><span style=color:#6e7681> </span><span style=color:#7ee787>synchronous_commit</span>:<span style=color:#6e7681> </span><span style=color:#a5d6ff>&#34;off&#34;</span><span style=color:#6e7681> </span><span style=color:#8b949e;font-style:italic># The magic switch</span><span style=color:#6e7681>
</span></span></span></code></pre></div><p><strong>What this does</strong>: Postgres returns &ldquo;Success&rdquo; to the application as soon as the transaction is in memory. It flushes to disk in the background. In the event of a crash, I might lose the last ~500ms of data (mostly cache entries), which is an acceptable trade-off.</p><h2 id=verification>Verification
<a class=heading-link href=#verification><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>I re-ran the benchmark with <code>synchronous_commit = off</code>.</p><table><thead><tr><th>Metric</th><th>Before (<code>sync=on</code>)</th><th>After (<code>sync=off</code>)</th><th>Improvement</th></tr></thead><tbody><tr><td>Sequential x8 stream (Avg)</td><td>~4.8s</td><td><strong>0.40s</strong></td><td><strong>12x Faster</strong></td></tr><tr><td>Parallel x8 stream (Wall)</td><td>~10.5s</td><td><strong>2.45s</strong></td><td><strong>4x Faster</strong></td></tr></tbody></table><p>The latency vanished. The login became instant.</p><h2 id=key-insights>Key Insights
<a class=heading-link href=#key-insights><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><ul><li><strong>Read Release Notes</strong>: The shift from Redis to Postgres for caching was a major architectural change that I missed during the upgrade.</li><li><strong>Storage Matters</strong>: Distributed storage (Ceph/Longhorn) handles linear writes well, but struggles with latency-sensitive, high-frequency sync operations like WAL updates.</li><li><strong>Tuning Postgres</strong>: For workloads where immediate durability is less critical than latency (like caching tables), <code>synchronous_commit = off</code> is a powerful tool.</li><li><strong>Observability</strong>: The &ldquo;Wife Test&rdquo; is a valid monitoring alert. If a user complains it&rsquo;s slow, investigate the P99 latency, not just the average.</li></ul><h3 id=references>References
<a class=heading-link href=#references><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><ul><li><a href=https://docs.goauthentik.io/releases/2025.10/ class=external-link target=_blank rel=noopener>Authentik 2025.10 Release Notes</a></li><li><a href=https://www.postgresql.org/docs/current/wal-async-commit.html class=external-link target=_blank rel=noopener>PostgreSQL Documentation: Synchronous Commit</a></li></ul></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css integrity=sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0 crossorigin=anonymous><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js integrity=sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4 crossorigin=anonymous></script><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js integrity=sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05 crossorigin=anonymous onload='renderMathInElement(document.body,{delimiters:[{left:"$$",right:"$$",display:!0},{left:"$",right:"$",display:!1},{left:"\\(",right:"\\)",display:!1},{left:"\\[",right:"\\]",display:!0}]})'></script></section></div><footer class=footer><section class=container>©
2016 -
2026
Eric X. Liu
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/6100dca">[6100dca]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>