47 lines
23 KiB
HTML
47 lines
23 KiB
HTML
<!doctype html><html lang=en><head><title>Why Your "Resilient" Homelab is Slower Than a Raspberry Pi · Eric X. Liu's Personal Page</title><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=color-scheme content="light dark"><meta http-equiv=Content-Security-Policy content="upgrade-insecure-requests; block-all-mixed-content; default-src 'self'; child-src 'self'; font-src 'self' https://fonts.gstatic.com https://cdn.jsdelivr.net/; form-action 'self'; frame-src 'self' https://www.youtube.com https://disqus.com; img-src 'self' https://referrer.disqus.com https://c.disquscdn.com https://*.disqus.com; object-src 'none'; style-src 'self' 'unsafe-inline' https://fonts.googleapis.com/ https://cdn.jsdelivr.net/; script-src 'self' 'unsafe-inline' https://www.google-analytics.com https://cdn.jsdelivr.net/ https://pagead2.googlesyndication.com https://static.cloudflareinsights.com https://unpkg.com https://ericxliu-me.disqus.com https://disqus.com https://*.disqus.com https://*.disquscdn.com https://unpkg.com; connect-src 'self' https://www.google-analytics.com https://pagead2.googlesyndication.com https://cloudflareinsights.com ws://localhost:1313 ws://localhost:* wss://localhost:* https://links.services.disqus.com https://*.disqus.com;"><meta name=author content="Eric X. Liu"><meta name=description content="In the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running “production” at home, there is only one metric that truly matters: The Wife Acceptance Factor (WAF).
|
|
My detailed Grafana dashboards said everything was fine. But my wife said the SSO login was “slow sometimes.” She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage."><meta name=keywords content="software engineer,performance engineering,Google engineer,tech blog,software development,performance optimization,Eric Liu,engineering blog,mountain biking,Jeep enthusiast,overlanding,camping,outdoor adventures"><meta name=twitter:card content="summary"><meta name=twitter:title content='Why Your "Resilient" Homelab is Slower Than a Raspberry Pi'><meta name=twitter:description content="In the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running “production” at home, there is only one metric that truly matters: The Wife Acceptance Factor (WAF).
|
|
My detailed Grafana dashboards said everything was fine. But my wife said the SSO login was “slow sometimes.” She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage."><meta property="og:url" content="https://ericxliu.me/posts/debugging-authentik-performance/"><meta property="og:site_name" content="Eric X. Liu's Personal Page"><meta property="og:title" content='Why Your "Resilient" Homelab is Slower Than a Raspberry Pi'><meta property="og:description" content="In the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running “production” at home, there is only one metric that truly matters: The Wife Acceptance Factor (WAF).
|
|
My detailed Grafana dashboards said everything was fine. But my wife said the SSO login was “slow sometimes.” She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage."><meta property="og:locale" content="en"><meta property="og:type" content="article"><meta property="article:section" content="posts"><meta property="article:published_time" content="2026-01-02T00:00:00+00:00"><meta property="article:modified_time" content="2026-01-03T06:23:35+00:00"><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=canonical href=https://ericxliu.me/posts/debugging-authentik-performance/><link rel=preload href=/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin><link rel=preload href=/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin><link rel=stylesheet href=/css/coder.min.4b392a85107b91dbdabc528edf014a6ab1a30cd44cafcd5325c8efe796794fca.css integrity="sha256-SzkqhRB7kdvavFKO3wFKarGjDNRMr81TJcjv55Z5T8o=" crossorigin=anonymous media=screen><link rel=stylesheet href=/css/coder-dark.min.a00e6364bacbc8266ad1cc81230774a1397198f8cfb7bcba29b7d6fcb54ce57f.css integrity="sha256-oA5jZLrLyCZq0cyBIwd0oTlxmPjPt7y6KbfW/LVM5X8=" crossorigin=anonymous media=screen><link rel=icon type=image/svg+xml href=/images/favicon.svg sizes=any><link rel=icon type=image/png href=/images/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/images/favicon-16x16.png sizes=16x16><link rel=apple-touch-icon href=/images/apple-touch-icon.png><link rel=apple-touch-icon sizes=180x180 href=/images/apple-touch-icon.png><link rel=manifest href=/site.webmanifest><link rel=mask-icon href=/images/safari-pinned-tab.svg color=#5bbad5><script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-3972604619956476" crossorigin=anonymous></script><script type=application/ld+json>{"@context":"http://schema.org","@type":"Person","name":"Eric X. Liu","url":"https:\/\/ericxliu.me\/","description":"Software \u0026 Performance Engineer at Google","sameAs":["https:\/\/www.linkedin.com\/in\/eric-x-liu-46648b93\/","https:\/\/git.ericxliu.me\/eric"]}</script><script type=application/ld+json>{"@context":"http://schema.org","@type":"BlogPosting","headline":"Why Your \u0022Resilient\u0022 Homelab is Slower Than a Raspberry Pi","genre":"Blog","wordcount":"1027","url":"https:\/\/ericxliu.me\/posts\/debugging-authentik-performance\/","datePublished":"2026-01-02T00:00:00\u002b00:00","dateModified":"2026-01-03T06:23:35\u002b00:00","description":"\u003cp\u003eIn the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running \u0026ldquo;production\u0026rdquo; at home, there is only one metric that truly matters: \u003cstrong\u003eThe Wife Acceptance Factor (WAF)\u003c\/strong\u003e.\u003c\/p\u003e\n\u003cp\u003eMy detailed Grafana dashboards said everything was fine. But my wife said the SSO login was \u0026ldquo;slow sometimes.\u0026rdquo; She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage.\u003c\/p\u003e","author":{"@type":"Person","name":"Eric X. Liu"}}</script></head><body class="preload-transitions colorscheme-auto"><div class=float-container><a id=dark-mode-toggle class=colorscheme-toggle><i class="fa-solid fa-adjust fa-fw" aria-hidden=true></i></a></div><main class=wrapper><nav class=navigation><section class=container><a class=navigation-title href=https://ericxliu.me/>Eric X. Liu's Personal Page
|
|
</a><input type=checkbox id=menu-toggle>
|
|
<label class="menu-button float-right" for=menu-toggle><i class="fa-solid fa-bars fa-fw" aria-hidden=true></i></label><ul class=navigation-list><li class=navigation-item><a class=navigation-link href=/posts/>Posts</a></li><li class=navigation-item><a class=navigation-link href=https://chat.ericxliu.me>Chat</a></li><li class=navigation-item><a class=navigation-link href=https://git.ericxliu.me/user/oauth2/Authenitk>Git</a></li><li class=navigation-item><a class=navigation-link href=https://coder.ericxliu.me/api/v2/users/oidc/callback>Coder</a></li><li class=navigation-item><a class=navigation-link href=/about/>About</a></li><li class=navigation-item><a class=navigation-link href=/>|</a></li><li class=navigation-item><a class=navigation-link href=https://sso.ericxliu.me>Sign in</a></li></ul></section></nav><div class=content><section class="container post"><article><header><div class=post-title><h1 class=title><a class=title-link href=https://ericxliu.me/posts/debugging-authentik-performance/>Why Your "Resilient" Homelab is Slower Than a Raspberry Pi</a></h1></div><div class=post-meta><div class=date><span class=posted-on><i class="fa-solid fa-calendar" aria-hidden=true></i>
|
|
<time datetime=2026-01-02T00:00:00Z>January 2, 2026
|
|
</time></span><span class=reading-time><i class="fa-solid fa-clock" aria-hidden=true></i>
|
|
5-minute read</span></div></div></header><div class=post-content><p>In the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running “production” at home, there is only one metric that truly matters: <strong>The Wife Acceptance Factor (WAF)</strong>.</p><p>My detailed Grafana dashboards said everything was fine. But my wife said the SSO login was “slow sometimes.” She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage.</p><p>Here is a breakdown of the symptoms, the red herrings, and the root cause that was hiding in plain sight.</p><h2 id=the-environment>The Environment
|
|
<a class=heading-link href=#the-environment><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h2><p>My homelab is designed for node-level resilience, which adds complexity to the storage layer. It is not running on a single server, but rather a 3-node <strong>Proxmox</strong> cluster where every component is redundant:</p><ul><li><strong>Orchestration</strong>: Kubernetes (k3s) managed via Flux CD.</li><li><strong>Storage</strong>: A <strong>Ceph</strong> cluster running on the Proxmox nodes, utilizing enterprise NVMe SSDs (<code>bluestore</code>) for OSDs.</li><li><strong>Database</strong>: Postgres managed by the Zalando Postgres Operator, with persistent volumes (PVCs) provisioned on Ceph RBD (block storage).</li><li><strong>Identity</strong>: Authentik for SSO.</li></ul><p>While the underlying disks are blazing fast NVMe drives, the architecture dictates that a write to a Ceph RBD volume is not complete until it is replicated over the network and acknowledged by multiple OSDs. This setup provides incredible resilience—I can pull the plug on a node and nothing stops—but it introduces unavoidable network latency for synchronous write operations. <strong>Keep this particular trade-off in mind; it plays a starring role in the investigation later.</strong></p><h2 id=the-symptom>The Symptom
|
|
<a class=heading-link href=#the-symptom><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h2><p>The issue was insidious because it was intermittent. Clicking “Login” would sometimes hang for 5-8 seconds, while other times it was instant. To an engineer, “sometimes slow” is the worst kind of bug because it defies easy reproduction.</p><p>The breakthrough came when I put aside the server-side Grafana dashboards and looked at the client side. By opening Chrome DevTools and monitoring the <strong>Network</strong> tab during a slow login attempt, I was able to capture the exact failing request.</p><p>I identified the culprit: the <code>/api/v3/core/applications/</code> endpoint. It wasn’t a connection timeout or a DNS issue; the server was simply taking 5+ seconds to respond to this specific GET request.</p><p>Armed with this “smoking gun,” I copied the request as cURL (preserving the session cookies) and converted it into a Python benchmark script (<code>reproduce_latency.py</code>). This allowed me to reliably trigger the latency on demand, turning an intermittent “heisenbug” into a reproducible test case.</p><p>The results were validating and horrifying:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-text data-lang=text><span style=display:flex><span>Request 1: 2.1642s
|
|
</span></span><span style=display:flex><span>Request 2: 8.4321s
|
|
</span></span><span style=display:flex><span>Request 3: 5.1234s
|
|
</span></span><span style=display:flex><span>...
|
|
</span></span><span style=display:flex><span>Avg Latency: 4.8s
|
|
</span></span></code></pre></div><h2 id=investigation--red-herrings>Investigation & Red Herrings
|
|
<a class=heading-link href=#investigation--red-herrings><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h2><h3 id=attempt-1-the-connection-overhead-hypothesis>Attempt 1: The Connection Overhead Hypothesis
|
|
<a class=heading-link href=#attempt-1-the-connection-overhead-hypothesis><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h3><p><strong>The Hypothesis</strong>: Authentik defaults to <code>CONN_MAX_AGE=0</code>, meaning it closes the database connection after every request. Since I enforce SSL for the database, I assumed the handshake overhead was killing performance.</p><p><strong>The Fix Attempt</strong>: I updated the Authentik configuration to enable persistent connections:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-yaml data-lang=yaml><span style=display:flex><span><span style=color:#7ee787>env</span>:<span style=color:#6e7681>
|
|
</span></span></span><span style=display:flex><span><span style=color:#6e7681> </span>- <span style=color:#7ee787>name</span>:<span style=color:#6e7681> </span><span style=color:#a5d6ff>AUTHENTIK_POSTGRESQL__CONN_MAX_AGE</span><span style=color:#6e7681>
|
|
</span></span></span><span style=display:flex><span><span style=color:#6e7681> </span><span style=color:#7ee787>value</span>:<span style=color:#6e7681> </span><span style=color:#a5d6ff>"600"</span><span style=color:#6e7681>
|
|
</span></span></span></code></pre></div><p><strong>The Reality</strong>: The benchmark showed a slight improvement (~4.2s average), but the random 5-8s spikes remained. The 300ms connection setup was a factor, but not the root cause. As a side note, enabling this without configuring TCP Keepalives caused the Authentik worker to crash with <code>OperationalError('the connection is closed')</code> when firewalls silently dropped idle connections.</p><h3 id=attempt-2-cpu-starvation>Attempt 2: CPU Starvation
|
|
<a class=heading-link href=#attempt-2-cpu-starvation><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h3><p><strong>The Hypothesis</strong>: The pods were CPU throttled during request processing.</p><p><strong>The Reality</strong>: <code>kubectl top pods</code> showed the server using only 29m (2.9% of a core). Even increasing the Gunicorn worker count from 2 to 4 did not improve the latency of individual requests, though it did help with concurrency.</p><h2 id=the-root-cause-a-perfect-storm>The Root Cause: A Perfect Storm
|
|
<a class=heading-link href=#the-root-cause-a-perfect-storm><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h2><p>I was stuck. The CPU was idle, network was fine, and individual database queries were fast (<1ms). Then I looked at the traffic patterns:</p><ol><li><strong>Redis</strong>: Almost zero traffic.</li><li><strong>Postgres</strong>: High <code>WALSync</code> and <code>WALWrite</code> wait times.</li><li><strong>The Table</strong>: <code>django_postgres_cache_cacheentry</code> was getting hammered.</li></ol><h3 id=insight-the-breaking-change>Insight: The Breaking Change
|
|
<a class=heading-link href=#insight-the-breaking-change><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h3><p>I checked the release notes for <strong>Authentik 2025.10</strong>:</p><blockquote><p><em>Breaking Change: Redis is no longer used for caching. All caching has been moved to the PostgreSQL database to simplify deployment.</em></p></blockquote><p>This architectural shift created a bottleneck specific to my storage backend:</p><ol><li><strong>The Change</strong>: Every API request triggers a cache write (session updates) to Postgres instead of Redis.</li><li><strong>The Default</strong>: Postgres defaults to <code>synchronous_commit = on</code>. A transaction is not considered “committed” until it is flushed to disk.</li><li><strong>The Storage</strong>: Ceph RBD replicates data across the network to multiple OSDs.</li></ol><p>Every time I loaded the dashboard, Authentik tried to update the cache. Postgres paused, verified the write was replicated to 3 other servers over the network (WAL Sync), and <em>then</em> responded.</p><h2 id=the-solution>The Solution
|
|
<a class=heading-link href=#the-solution><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h2><p>I couldn’t move the database to local NVMe without losing the failover capabilities I built the cluster for. However, for a cache-heavy workload, I could compromise on strict durability.</p><p>I patched the Postgres configuration to disable synchronous commits:</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-yaml data-lang=yaml><span style=display:flex><span><span style=color:#7ee787>spec</span>:<span style=color:#6e7681>
|
|
</span></span></span><span style=display:flex><span><span style=color:#6e7681> </span><span style=color:#7ee787>postgresql</span>:<span style=color:#6e7681>
|
|
</span></span></span><span style=display:flex><span><span style=color:#6e7681> </span><span style=color:#7ee787>parameters</span>:<span style=color:#6e7681>
|
|
</span></span></span><span style=display:flex><span><span style=color:#6e7681> </span><span style=color:#7ee787>synchronous_commit</span>:<span style=color:#6e7681> </span><span style=color:#a5d6ff>"off"</span><span style=color:#6e7681> </span><span style=color:#8b949e;font-style:italic># The magic switch</span><span style=color:#6e7681>
|
|
</span></span></span></code></pre></div><p><strong>What this does</strong>: Postgres returns “Success” to the application as soon as the transaction is in memory. It flushes to disk in the background. In the event of a crash, I might lose the last ~500ms of data (mostly cache entries), which is an acceptable trade-off.</p><h2 id=verification>Verification
|
|
<a class=heading-link href=#verification><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h2><p>I re-ran the benchmark with <code>synchronous_commit = off</code>.</p><table><thead><tr><th>Metric</th><th>Before (<code>sync=on</code>)</th><th>After (<code>sync=off</code>)</th><th>Improvement</th></tr></thead><tbody><tr><td>Sequential (Avg)</td><td>~4.8s</td><td><strong>0.40s</strong></td><td><strong>12x Faster</strong></td></tr><tr><td>Parallel (Wall)</td><td>~10.5s</td><td><strong>2.45s</strong></td><td><strong>4x Faster</strong></td></tr></tbody></table><p>The latency vanished. The login became instant.</p><h2 id=key-insights>Key Insights
|
|
<a class=heading-link href=#key-insights><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h2><ul><li><strong>Read Release Notes</strong>: The shift from Redis to Postgres for caching was a major architectural change that I missed during the upgrade.</li><li><strong>Storage Matters</strong>: Distributed storage (Ceph/Longhorn) handles linear writes well, but struggles with latency-sensitive, high-frequency sync operations like WAL updates.</li><li><strong>Tuning Postgres</strong>: For workloads where immediate durability is less critical than latency (like caching tables), <code>synchronous_commit = off</code> is a powerful tool.</li><li><strong>Observability</strong>: The “Wife Test” is a valid monitoring alert. If a user complains it’s slow, investigate the P99 latency, not just the average.</li></ul><h3 id=references>References
|
|
<a class=heading-link href=#references><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
|
|
<span class=sr-only>Link to heading</span></a></h3><ul><li><a href=https://docs.goauthentik.io/releases/2025.10/ class=external-link target=_blank rel=noopener>Authentik 2025.10 Release Notes</a></li><li><a href=https://www.postgresql.org/docs/current/wal-async-commit.html class=external-link target=_blank rel=noopener>PostgreSQL Documentation: Synchronous Commit</a></li></ul></div><footer><div id=disqus_thread></div><script>window.disqus_config=function(){},function(){if(["localhost","127.0.0.1"].indexOf(window.location.hostname)!=-1){document.getElementById("disqus_thread").innerHTML="Disqus comments not available by default when the website is previewed locally.";return}var t=document,e=t.createElement("script");e.async=!0,e.src="//ericxliu-me.disqus.com/embed.js",e.setAttribute("data-timestamp",+new Date),(t.head||t.body).appendChild(e)}(),document.addEventListener("themeChanged",function(){document.readyState=="complete"&&DISQUS.reset({reload:!0,config:disqus_config})})</script></footer></article><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css integrity=sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0 crossorigin=anonymous><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js integrity=sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4 crossorigin=anonymous></script><script defer src=https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js integrity=sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05 crossorigin=anonymous onload='renderMathInElement(document.body,{delimiters:[{left:"$$",right:"$$",display:!0},{left:"$",right:"$",display:!1},{left:"\\(",right:"\\)",display:!1},{left:"\\[",right:"\\]",display:!0}]})'></script></section></div><footer class=footer><section class=container>©
|
|
2016 -
|
|
2026
|
|
Eric X. Liu
|
|
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/89dc118">[89dc118]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html> |