Django + Gunicorn API intermittently hangs (504 errors) — workers appear idle but requests never complete

I am running a Django application inside Docker using Gunicorn with multiple workers and threads. The system also includes PostgreSQL, Redis, and multiple Celery workers.

The issue appears after the service has been running continuously for ~4 days.

Problem

After around 4 days of uptime:

All frontend API requests start failing with 504 Gateway Timeout
Django admin panel continues to work normally
Gunicorn workers are still running (no crashes)
No container restarts occur
CPU usage is low and workers appear idle (ep_poll, do_select)
Requests never complete (no response returned)

Observations

DB connections remain healthy:
- idle: 5–7
- active: 1
Gunicorn processes remain alive:
- no worker crashes observed
- no restart loops
Internal health check logs show:
- Response code: 000 (no HTTP response received)

Gunicorn config

gunicorn alloy_admin.wsgi:application
--workers 3
--threads 2
--timeout 120
--max-requests 1000
--max-requests-jitter 100
--keep-alive 5

Question

What could cause a Django + Gunicorn setup to work normally for several days and then gradually reach a state where all API requests hang and eventually return 504, while workers remain alive and idle?

Is this typically caused by:

Thread exhaustion / worker saturation?
Blocking synchronous external API calls inside views?
Connection pool exhaustion (DB / HTTP)?
Nginx / proxy timeout issues?
Or long-lived memory/state issues in Gunicorn workers?

What is the recommended way to debug this kind of “delayed global hang” where the system degrades over time without crashes?

Workers staying alive with low CPU while requests pile up at the proxy usually means one of two things:

requests are no longer making it to Gunicorn workers reliably (backlog/FD/socket exhaustion somewhere between proxy and Gunicorn)
workers are blocked waiting on external I/O (HTTP calls, Redis, DNS, upstream services, etc.)

The healthy DB pool and workers sitting in epoll/select make a DB bottleneck or CPU saturation less likely.

The admin panel still working doesn't necessarily rule either out. Admin traffic is usually much lighter and may avoid whatever code path or dependency your API endpoints are hitting.

When the issue happens again, I'd check what the workers are actually waiting on:

strace -fp <worker-pid>

If workers are consistently sitting in epoll_wait with no activity, requests may not be reaching them at all. If they're blocked in recv, send, connect, DNS, or similar syscalls, then the workers are probably waiting on some external dependency.

Also inspect socket state between the proxy and Gunicorn:

ss -tanp | grep gunicorn
lsof -p <gunicorn-pid>

Look for large numbers of:

SYN_RECV
CLOSE_WAIT
long-lived established connections
unusually high FD counts

Another useful signal: when the system is hung, do requests still appear in Gunicorn access logs? If nginx/proxy logs requests but Gunicorn never sees them, that points more toward connection backlog/resource exhaustion before the workers.

If you have synchronous external calls inside views/tasks, I'd inspect those too. A few stuck outbound requests can eventually consume all available workers/threads and make the whole API appear dead while the processes themselves stay healthy.

I'd start with strace during the hang. That usually narrows it down very quickly.

Вернуться на верх