7.1 KiB
In the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running "production" at home, there is only one metric that truly matters: The Wife Acceptance Factor (WAF).
My detailed Grafana dashboards said everything was fine. But my wife said the SSO login was "slow sometimes." She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage.
Here is a breakdown of the symptoms, the red herrings, and the root cause that was hiding in plain sight.
The Environment
My homelab is designed for node-level resilience, which adds complexity to the storage layer. It is not running on a single server, but rather a 3-node Proxmox cluster where every component is redundant:
- Orchestration: Kubernetes (k3s) managed via Flux CD.
- Storage: A Ceph cluster running on the Proxmox nodes, utilizing enterprise NVMe SSDs (
bluestore) for OSDs. - Database: Postgres managed by the Zalando Postgres Operator, with persistent volumes (PVCs) provisioned on Ceph RBD (block storage).
- Identity: Authentik for SSO.
While the underlying disks are blazing fast NVMe drives, the architecture dictates that a write to a Ceph RBD volume is not complete until it is replicated over the network and acknowledged by multiple OSDs. This setup provides incredible resilience—I can pull the plug on a node and nothing stops—but it introduces unavoidable network latency for synchronous write operations. Keep this particular trade-off in mind; it plays a starring role in the investigation later.
The Symptom
The issue was insidious because it was intermittent. Clicking "Login" would sometimes hang for 5-8 seconds, while other times it was instant. To an engineer, "sometimes slow" is the worst kind of bug because it defies easy reproduction.
The breakthrough came when I put aside the server-side Grafana dashboards and looked at the client side. By opening Chrome DevTools and monitoring the Network tab during a slow login attempt, I was able to capture the exact failing request.
I identified the culprit: the /api/v3/core/applications/ endpoint. It wasn't a connection timeout or a DNS issue; the server was simply taking 5+ seconds to respond to this specific GET request.
Armed with this "smoking gun," I copied the request as cURL (preserving the session cookies) and converted it into a Python benchmark script (reproduce_latency.py). This allowed me to reliably trigger the latency on demand, turning an intermittent "heisenbug" into a reproducible test case.
The results were validating and horrifying:
Request 1: 2.1642s
Request 2: 8.4321s
Request 3: 5.1234s
...
Avg Latency: 4.8s
Investigation & Red Herrings
Attempt 1: The Connection Overhead Hypothesis
The Hypothesis: Authentik defaults to CONN_MAX_AGE=0, meaning it closes the database connection after every request. Since I enforce SSL for the database, I assumed the handshake overhead was killing performance.
The Fix Attempt: I updated the Authentik configuration to enable persistent connections:
env:
- name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE
value: "600"
The Reality: The benchmark showed a slight improvement (~4.2s average), but the random 5-8s spikes remained. The 300ms connection setup was a factor, but not the root cause. As a side note, enabling this without configuring TCP Keepalives caused the Authentik worker to crash with OperationalError('the connection is closed') when firewalls silently dropped idle connections.
Attempt 2: CPU Starvation
The Hypothesis: The pods were CPU throttled during request processing.
The Reality: kubectl top pods showed the server using only 29m (2.9% of a core). Even increasing the Gunicorn worker count from 2 to 4 did not improve the latency of individual requests, though it did help with concurrency.
The Root Cause: A Perfect Storm
I was stuck. The CPU was idle, network was fine, and individual database queries were fast (<1ms). Then I looked at the traffic patterns:
- Redis: Almost zero traffic.
- Postgres: High
WALSyncandWALWritewait times. - The Table:
django_postgres_cache_cacheentrywas getting hammered.
Insight: The Breaking Change
I checked the release notes for Authentik 2025.10:
Breaking Change: Redis is no longer used for caching. All caching has been moved to the PostgreSQL database to simplify deployment.
This architectural shift created a bottleneck specific to my storage backend:
- The Change: Every API request triggers a cache write (session updates) to Postgres instead of Redis.
- The Default: Postgres defaults to
synchronous_commit = on. A transaction is not considered "committed" until it is flushed to disk. - The Storage: Ceph RBD replicates data across the network to multiple OSDs.
Every time I loaded the dashboard, Authentik tried to update the cache. Postgres paused, verified the write was replicated to 3 other servers over the network (WAL Sync), and then responded.
The Solution
I couldn't move the database to local NVMe without losing the failover capabilities I built the cluster for. However, for a cache-heavy workload, I could compromise on strict durability.
I patched the Postgres configuration to disable synchronous commits:
spec:
postgresql:
parameters:
synchronous_commit: "off" # The magic switch
What this does: Postgres returns "Success" to the application as soon as the transaction is in memory. It flushes to disk in the background. In the event of a crash, I might lose the last ~500ms of data (mostly cache entries), which is an acceptable trade-off.
Verification
I re-ran the benchmark with synchronous_commit = off.
| Metric | Before (sync=on) |
After (sync=off) |
Improvement |
|---|---|---|---|
| Sequential x8 stream (Avg) | ~4.8s | 0.40s | 12x Faster |
| Parallel x8 stream (Wall) | ~10.5s | 2.45s | 4x Faster |
The latency vanished. The login became instant.
Key Insights
- Read Release Notes: The shift from Redis to Postgres for caching was a major architectural change that I missed during the upgrade.
- Storage Matters: Distributed storage (Ceph/Longhorn) handles linear writes well, but struggles with latency-sensitive, high-frequency sync operations like WAL updates.
- Tuning Postgres: For workloads where immediate durability is less critical than latency (like caching tables),
synchronous_commit = offis a powerful tool. - Observability: The "Wife Test" is a valid monitoring alert. If a user complains it's slow, investigate the P99 latency, not just the average.