Django + Gunicorn API intermittently hangs (504 errors) — workers appear idle but requests never complete

I am running a Django application inside Docker using Gunicorn with multiple workers and threads. The system also includes PostgreSQL, Redis, and multiple Celery workers.

The issue appears after the service has been running continuously for ~4 days.

Problem

After around 4 days of uptime:

  • All frontend API requests start failing with 504 Gateway Timeout
  • Django admin panel continues to work normally
  • Gunicorn workers are still running (no crashes)
  • No container restarts occur
  • CPU usage is low and workers appear idle (ep_poll, do_select)
  • Requests never complete (no response returned)

Observations

  • DB connections remain healthy:

    • idle: 5–7
    • active: 1
  • Gunicorn processes remain alive:

    • no worker crashes observed
    • no restart loops
  • Internal health check logs show:

    • Response code: 000 (no HTTP response received)

Gunicorn config

gunicorn alloy_admin.wsgi:application
--workers 3
--threads 2
--timeout 120
--max-requests 1000
--max-requests-jitter 100
--keep-alive 5

Question

What could cause a Django + Gunicorn setup to work normally for several days and then gradually reach a state where all API requests hang and eventually return 504, while workers remain alive and idle?

Is this typically caused by:

  • Thread exhaustion / worker saturation?
  • Blocking synchronous external API calls inside views?
  • Connection pool exhaustion (DB / HTTP)?
  • Nginx / proxy timeout issues?
  • Or long-lived memory/state issues in Gunicorn workers?

What is the recommended way to debug this kind of “delayed global hang” where the system degrades over time without crashes?

Вернуться на верх