Intermittent 502/504 errors in Kubernetes (Traefik + FastAPI/Uvicorn), worse under load spikes

I have a FastAPI service running on Kubernetes behind Traefik ingress (Previously had Nginx but the issue is consistent).

Setup

  • FastAPI (Uvicorn, ~8 workers)

  • Kubernetes limits: 500m CPU / 1Gi memory

  • Traefik ingress controller (shared cluster-wide)

  • PostgreSQL backend (Django ORM usage in background workloads)

  • Readiness probe: /healthz (1s timeout)

What I tried

  • Increased replica count → no effect

  • Increased CPU/memory limits → no effect

  • Tuned Django DB connection settings (timeouts, pooling-related settings) → no effect

  • Disabled Uvicorn reload → no effect

  • Verified not much data in database, probably 5-10 rows

  • Observed baseline: small number of intermittent 5xx errors even without cron activity under continuous polling

  • Errors significantly increase immediately when periodic cron-driven DB activity starts/resumes

  • Enabled a “bye mode” to bypass DB writes/reads during spikes → 5xx still occurred when activity resumed

  • Removed database connection checks from /healthz endpoint → no effect

Observation

The system shows low but non-zero 5xx errors even under normal conditions, but error rate increases sharply and almost immediately under periodic background DB load patterns. The issue appears correlated with load spikes rather than a specific endpoint or query.

Вернуться на верх