📚 Auto-publish: Add/update 1 blog posts
Generated on: Sat Jan 3 06:23:35 UTC 2026 Source: md-personal repository
This commit is contained in:
117
content/posts/debugging-authentik-performance.md
Normal file
117
content/posts/debugging-authentik-performance.md
Normal file
@@ -0,0 +1,117 @@
|
|||||||
|
---
|
||||||
|
title: "Why Your "Resilient" Homelab is Slower Than a Raspberry Pi"
|
||||||
|
date: 2026-01-02
|
||||||
|
draft: false
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
In the world of self-hosting, there are many metrics for success: 99.9% uptime, sub-second latency, or a perfect GitOps pipeline. But for those of us running "production" at home, there is only one metric that truly matters: **The Wife Acceptance Factor (WAF)**.
|
||||||
|
|
||||||
|
My detailed Grafana dashboards said everything was fine. But my wife said the SSO login was "slow sometimes." She was right. Debugging it took me down a rabbit hole of connection pooling, misplaced assumptions, and the harsh reality of running databases on distributed storage.
|
||||||
|
|
||||||
|
Here is a breakdown of the symptoms, the red herrings, and the root cause that was hiding in plain sight.
|
||||||
|
|
||||||
|
## The Environment
|
||||||
|
|
||||||
|
My homelab is designed for node-level resilience, which adds complexity to the storage layer. It is not running on a single server, but rather a 3-node **Proxmox** cluster where every component is redundant:
|
||||||
|
|
||||||
|
- **Orchestration**: Kubernetes (k3s) managed via Flux CD.
|
||||||
|
- **Storage**: A **Ceph** cluster running on the Proxmox nodes, utilizing enterprise NVMe SSDs (`bluestore`) for OSDs.
|
||||||
|
- **Database**: Postgres managed by the Zalando Postgres Operator, with persistent volumes (PVCs) provisioned on Ceph RBD (block storage).
|
||||||
|
- **Identity**: Authentik for SSO.
|
||||||
|
|
||||||
|
While the underlying disks are blazing fast NVMe drives, the architecture dictates that a write to a Ceph RBD volume is not complete until it is replicated over the network and acknowledged by multiple OSDs. This setup provides incredible resilience—I can pull the plug on a node and nothing stops—but it introduces unavoidable network latency for synchronous write operations. **Keep this particular trade-off in mind; it plays a starring role in the investigation later.**
|
||||||
|
|
||||||
|
## The Symptom
|
||||||
|
|
||||||
|
The issue was insidious because it was intermittent. Clicking "Login" would sometimes hang for 5-8 seconds, while other times it was instant. To an engineer, "sometimes slow" is the worst kind of bug because it defies easy reproduction.
|
||||||
|
|
||||||
|
The breakthrough came when I put aside the server-side Grafana dashboards and looked at the client side. By opening Chrome DevTools and monitoring the **Network** tab during a slow login attempt, I was able to capture the exact failing request.
|
||||||
|
|
||||||
|
I identified the culprit: the `/api/v3/core/applications/` endpoint. It wasn't a connection timeout or a DNS issue; the server was simply taking 5+ seconds to respond to this specific GET request.
|
||||||
|
|
||||||
|
Armed with this "smoking gun," I copied the request as cURL (preserving the session cookies) and converted it into a Python benchmark script (`reproduce_latency.py`). This allowed me to reliably trigger the latency on demand, turning an intermittent "heisenbug" into a reproducible test case.
|
||||||
|
|
||||||
|
The results were validating and horrifying:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Request 1: 2.1642s
|
||||||
|
Request 2: 8.4321s
|
||||||
|
Request 3: 5.1234s
|
||||||
|
...
|
||||||
|
Avg Latency: 4.8s
|
||||||
|
```
|
||||||
|
|
||||||
|
## Investigation & Red Herrings
|
||||||
|
|
||||||
|
### Attempt 1: The Connection Overhead Hypothesis
|
||||||
|
**The Hypothesis**: Authentik defaults to `CONN_MAX_AGE=0`, meaning it closes the database connection after every request. Since I enforce SSL for the database, I assumed the handshake overhead was killing performance.
|
||||||
|
|
||||||
|
**The Fix Attempt**: I updated the Authentik configuration to enable persistent connections:
|
||||||
|
```yaml
|
||||||
|
env:
|
||||||
|
- name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE
|
||||||
|
value: "600"
|
||||||
|
```
|
||||||
|
|
||||||
|
**The Reality**: The benchmark showed a slight improvement (~4.2s average), but the random 5-8s spikes remained. The 300ms connection setup was a factor, but not the root cause. As a side note, enabling this without configuring TCP Keepalives caused the Authentik worker to crash with `OperationalError('the connection is closed')` when firewalls silently dropped idle connections.
|
||||||
|
|
||||||
|
### Attempt 2: CPU Starvation
|
||||||
|
**The Hypothesis**: The pods were CPU throttled during request processing.
|
||||||
|
|
||||||
|
**The Reality**: `kubectl top pods` showed the server using only 29m (2.9% of a core). Even increasing the Gunicorn worker count from 2 to 4 did not improve the latency of individual requests, though it did help with concurrency.
|
||||||
|
|
||||||
|
## The Root Cause: A Perfect Storm
|
||||||
|
|
||||||
|
I was stuck. The CPU was idle, network was fine, and individual database queries were fast (<1ms). Then I looked at the traffic patterns:
|
||||||
|
1. **Redis**: Almost zero traffic.
|
||||||
|
2. **Postgres**: High `WALSync` and `WALWrite` wait times.
|
||||||
|
3. **The Table**: `django_postgres_cache_cacheentry` was getting hammered.
|
||||||
|
|
||||||
|
### Insight: The Breaking Change
|
||||||
|
I checked the release notes for **Authentik 2025.10**:
|
||||||
|
> *Breaking Change: Redis is no longer used for caching. All caching has been moved to the PostgreSQL database to simplify deployment.*
|
||||||
|
|
||||||
|
This architectural shift created a bottleneck specific to my storage backend:
|
||||||
|
1. **The Change**: Every API request triggers a cache write (session updates) to Postgres instead of Redis.
|
||||||
|
2. **The Default**: Postgres defaults to `synchronous_commit = on`. A transaction is not considered "committed" until it is flushed to disk.
|
||||||
|
3. **The Storage**: Ceph RBD replicates data across the network to multiple OSDs.
|
||||||
|
|
||||||
|
Every time I loaded the dashboard, Authentik tried to update the cache. Postgres paused, verified the write was replicated to 3 other servers over the network (WAL Sync), and *then* responded.
|
||||||
|
|
||||||
|
## The Solution
|
||||||
|
|
||||||
|
I couldn't move the database to local NVMe without losing the failover capabilities I built the cluster for. However, for a cache-heavy workload, I could compromise on strict durability.
|
||||||
|
|
||||||
|
I patched the Postgres configuration to disable synchronous commits:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
spec:
|
||||||
|
postgresql:
|
||||||
|
parameters:
|
||||||
|
synchronous_commit: "off" # The magic switch
|
||||||
|
```
|
||||||
|
|
||||||
|
**What this does**: Postgres returns "Success" to the application as soon as the transaction is in memory. It flushes to disk in the background. In the event of a crash, I might lose the last ~500ms of data (mostly cache entries), which is an acceptable trade-off.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
I re-ran the benchmark with `synchronous_commit = off`.
|
||||||
|
|
||||||
|
| Metric | Before (`sync=on`) | After (`sync=off`) | Improvement |
|
||||||
|
|--------|-------------------|-------------------|-------------|
|
||||||
|
| Sequential (Avg) | ~4.8s | **0.40s** | **12x Faster** |
|
||||||
|
| Parallel (Wall) | ~10.5s | **2.45s** | **4x Faster** |
|
||||||
|
|
||||||
|
The latency vanished. The login became instant.
|
||||||
|
|
||||||
|
## Key Insights
|
||||||
|
|
||||||
|
* **Read Release Notes**: The shift from Redis to Postgres for caching was a major architectural change that I missed during the upgrade.
|
||||||
|
* **Storage Matters**: Distributed storage (Ceph/Longhorn) handles linear writes well, but struggles with latency-sensitive, high-frequency sync operations like WAL updates.
|
||||||
|
* **Tuning Postgres**: For workloads where immediate durability is less critical than latency (like caching tables), `synchronous_commit = off` is a powerful tool.
|
||||||
|
* **Observability**: The "Wife Test" is a valid monitoring alert. If a user complains it's slow, investigate the P99 latency, not just the average.
|
||||||
|
|
||||||
|
### References
|
||||||
|
* [Authentik 2025.10 Release Notes](https://docs.goauthentik.io/releases/2025.10/)
|
||||||
|
* [PostgreSQL Documentation: Synchronous Commit](https://www.postgresql.org/docs/current/wal-async-commit.html)
|
||||||
Reference in New Issue
Block a user