Why are we seeing incorrect timings in Datadog APM tracing for simple actions like a Redis GET?
We have Python Django application deployed to AWS EKS on EC2 instances. We use Gunicorn as our web server, although we recently ran Apache + wsgi and saw the same issues. The EC2 instances are m6a.xlarges, the containers themselves have 1500m CPU and 2048MB memory.
We use Datadog APM tracing and we're tracking Redis calls. Occasionally we'll see a slow request, and when we look deeper we'll see in the trace a simple Redis GET query take 2.8 seconds!
I've used Redis for years and I've never had a problem with it. It's lightening fast. And our platform produces very little load (CPU is around 2% on a small instance). I enabled slow logging in ElastiCache and we're seeing no results, so I know it isn't the Redis instance itself.
What could be causing this?
I know the timings in APM are based on ddtrace's own in-Python timings, so maybe our Python processes are getting bogged down somehow? CPU usage on the EC2 instances in general is very low, 10% tops. Memory usage is also low. I'm at a loss. We don't have over allocation, so I don't think it's that.