Large file download fails on slow network using HTTP/2 and AWS ALB
Given the following architecture:
client <-> AWS ALB <-> uwsgi <-> Django
The client fails to download a 12MB file when using HTTP/2 (the default), but works using HTTP/1.1. The file is streamed through Django for authentication purposes (it's fetched from a third party service).
Here is an example of failure (I'm using a Socks proxy to limit bandwidth):
$ curl -x socks5://localhost:1080 https://example.com/file.pdf --output "file.pdf"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
79 12.6M 79 10.0M 0 0 61254 0 0:03:36 0:02:51 0:00:45 64422
curl: (92) HTTP/2 stream 1 was not closed cleanly: PROTOCOL_ERROR (err 1)
However the same command with the flag --http1.1
works fine.
This fails when I limit downloads to 512kbps - it works at 1024Kbps. I've not looked for the sweet spot, it's not important.
Notes:
- This also fails with a browser, it's not a
curl
issue - Using
curl
with-v
doesn't give any additional information. uwsgi
does not output any errors. As far as it's concerned it did it's job. This is the output:[pid: 44|app: 0|req: 2603/9176] [ip] () {34 vars in 639 bytes} [Wed Oct 16 09:29:29 2024] GET /file.pdf => generated 13243923 bytes in 2425 msecs (HTTP/1.1 200) 8 headers in 314 bytes (103471 switches on core 0)
- Similarly there are no issues listed in the ALB logs. It logs the request as a succesful request, but with a number of bytes lower than the expected amount.
I'd like to understanding why it's failing with HTTP/2 and a slow network. I suspect it's something to do with the ALB.
There are two things that could be done and people might suggest:
- Put Nginx between the ALB and uwsgi
- Don't stream the file through Django
Both of these are valid suggestions, however I'd like to understand what the problem is before deciding on a solution.
So I have found three things that fix the issue:
- Ask the client to use HTTP/1.1
- Set the ALB to use HTTP/1.1
- Increase the idle timeout of the ALB to 120 seconds
That last point (which is the solution I'm going to go for) I've only tried as a last resort - because the same download succeeded with HTTP/1.1 (same number of bytes, same network conditions) I didn't expect timeouts or buffering to be the issue.
However it seems ALB handles timeouts more agressively with HTTP/2 than it does with HTTP/1.1
If I had to hazard a guess:
- uwsgi/nginx send all the bytes to the ALB, which buffers them [*]
- Another device (the NLB, proxies, routers, firewalls, etc.) between the ALB and client buffers some of the data - but not all of it
- The client reads from that intermediary device. By the time the client has read enough data that the intermediary device wants to fetch some more from the ALB, we've reached the 60 seconds timeout, so the ALB closed the connection
- This happens on HTTP/2 and not HTTP/1.1 possibly because the ALB is more aggressive in terms of closing streams with HTTP/2, because the same socket is shared by multiple downloads (Total guess there).
I'm answering my own question, but won't mark this is as the accepted answer, as they are just guesses. If someone had some definite answers I'd still like to hear them.
[*] the internet isn't clear as to whether ALBs do buffering - some people say it does, but as it's closed source and not specified in the documentation we don't have a definite answer