What strategies can I use to handle occasionally long running blocking database tasks in Django

I have a scientific data submission validation interface on my website. It essentially does a dry run load of the data and reports whether the dry run load succeeds and if it does not, it gives a report on errors and warnings that occurred during the dry run load.

All of this is done inside an atomic transaction. So if two users submit data to validate at the same time, one of the processes will be blocked until the first one finishes.

Most such tasks will take a handful of seconds, so it's not a big deal, but occasionally a user will submit a large amount of data for validation, and it can take just shy of a minute to do all of the processing in the worst case, in which case any other validation submission would encounter a timeout.

I am currently considering my options for mitigating or solving this issue.

Ideally, there would be some way for me to tell the database that all of this work will not be committed and to ignore any other concurrent database operations because all of this work will get thrown away anyway. I'm sure that this capability does not exist, but if I were to envision how it would work, I would just give some argument to the atomic decorator and say "ignore database locks because I promise not to commit".

Since that is likely a fantasy, and I cannot achieve true concurrency here, I figure the best solution is to use something like celery, so I can allow jobs to avoid a timeout.

But what I would like to do is be transparent with the user. I could implement some sort of progress bar to show them that their job is proceeding. I can give little updates during the process, but I would like to also be transparent about their job waiting for other jobs to release their locks on models/rows of the database. I.e. I would like to tell them that they are waiting in the queue and that there are, for example, two users in front of them.

So my question is, is it possible to, in a database-architecture-agnostic way, find out how many users have validation jobs in the queue for these database validation jobs ahead of any one job submission?

My database is a postgres database, and I figured out how to determine what database operations are currently blocked, but from the perspective of one particular job, I can't tell what the wait order is. I also don't know how to make such a query so that if we swap out for a different database, it would still work, regardless of which database architecture we have.

My guess is that this is not possible either, at least not in a database agnostic way. I'm not very familiar with celery. I've played with it before and gotten it to work in a test case for simple progress bars, but that was a long time ago. And since it is designed for running processes concurrently, it seems like a mis-application for jobs like these that essentially must run serially.

Has anyone dealt with the problem like this before? What are my options?

Side note: validation jobs can also be blocked by data loading jobs that are happening on the backend by the curators who process data submissions. Those loading jobs take much longer than validation, for example, in excess of an hour. There are definitely some efficiency/optimization tasks that can mitigate all of this, but the fact remains that we are processing lots of data and it's just not going to be instantaneous. So my plan to handle loading is to schedule downtimes for those loads to occur.

Back to Top