Intermittent 502/504 errors in Kubernetes (Traefik + FastAPI/Uvicorn), worse under load spikes
I have a FastAPI service running on Kubernetes behind Traefik ingress (Previously had Nginx but the issue is consistent).
Setup
FastAPI (Uvicorn, ~8 workers)
Kubernetes limits: 500m CPU / 1Gi memory
Traefik ingress controller (shared cluster-wide)
PostgreSQL backend (Django ORM usage in background workloads)
Readiness probe:
/healthz(1s timeout)
What I tried
Increased replica count → no effect
Increased CPU/memory limits → no effect
Tuned Django DB connection settings (timeouts, pooling-related settings) → no effect
Disabled Uvicorn reload → no effect
Verified not much data in database, probably 5-10 rows
Observed baseline: small number of intermittent 5xx errors even without cron activity under continuous polling
Errors significantly increase immediately when periodic cron-driven DB activity starts/resumes
Enabled a “bye mode” to bypass DB writes/reads during spikes → 5xx still occurred when activity resumed
Removed database connection checks from
/healthzendpoint → no effect
Observation
The system shows low but non-zero 5xx errors even under normal conditions, but error rate increases sharply and almost immediately under periodic background DB load patterns. The issue appears correlated with load spikes rather than a specific endpoint or query.