Filtering Django QuerySet after using a Window function
For a simple model:
from django.db.models import CharField, Model, PositiveIntegerField
class Example(Model):
category = CharField(max_length=20, null=False, blank=False)
version = PositiveIntegerField(null=False, blank=False)
class Meta:
unique_together = ["category", "version"]
And some sample data:
Example.objects.update_or_create(id=1, category="Thing 1", version=1)
Example.objects.update_or_create(id=2, category="Thing 1", version=2)
Example.objects.update_or_create(id=3, category="Thing 1", version=4)
Example.objects.update_or_create(id=4, category="Thing 2", version=1)
Example.objects.update_or_create(id=5, category="Thing 2", version=2)
Example.objects.update_or_create(id=6, category="Thing 3", version=3)
I would like to use window functions to get:
- The examples that, only from the sub-set with the latest version in each category, have ids either
1,4,5or6.
I am trying to do this using the RowNumber window function:
from django.db.models import F, Window
from django.db.models.functions.window import RowNumber
results = Example.objects.alias(
rn=Window(
expression=RowNumber(),
partition_by=[F("category")],
order_by="-version"
),
).filter(rn=1).filter(id__in=[1,4,5,6])
print(results.query)
print(results)
This generates the query:
SELECT "col1",
"col2",
"col3"
FROM (
SELECT *
FROM (
SELECT "greatest_n_example"."id" AS "col1",
"greatest_n_example"."category" AS "col2",
"greatest_n_example"."version" AS "col3",
ROW_NUMBER() OVER (PARTITION BY "greatest_n_example"."category" ORDER BY "greatest_n_example"."version" DESC) AS "qual0"
FROM "greatest_n_example"
WHERE "greatest_n_example"."id" IN (1, 4, 5, 6)
) "qualify"
WHERE "qual0" = 1
) "qualify_mask"
And returns examples 1, 5 and 6; which is not what is required.
The latest for category="Thing 1" is version=4 and the filter on rn should be applied before the filter on id (it isn't, it is currently the other way round) so it is expected that all the Thing 1 category items should be excluded from the result set.
The expected query is equivalent to:
SELECT "col1",
"col2",
"col3"
FROM (
SELECT *
FROM (
SELECT "greatest_n_example"."id" AS "col1",
"greatest_n_example"."category" AS "col2",
"greatest_n_example"."version" AS "col3",
ROW_NUMBER() OVER (PARTITION BY "greatest_n_example"."category" ORDER BY "greatest_n_example"."version" DESC) AS "qual0"
FROM "greatest_n_example"
) "qualify"
WHERE "qual0" = 1
AND "col1" IN (1, 4, 5, 6)
) "qualify_mask"
Which would output just 5 and 6.
How, using window functions, can I enforce that the row-number filter needs to be happen, at least, within the same inline view of the SQL as the id filter is applied, if not before?
Note: I know I can solve this problem using Subquery, rather using than Window, to perform the filtering using correlated sub-queries; this question is specifically about how to use window functions and to get the WHERE filters in the correct inline-view of the SQL query.
You can solve it using Case and When:
from django.db.models import Case, F, When, Window
from django.db.models.functions.window import RowNumber
from greatest_n.models import Example
query = Example.objects.alias(
rn=Window(
expression=RowNumber(),
partition_by=[F("category")],
order_by="-version"
),
first_id=Case(When(rn=1, then=F("id")), default=None),
).filter(first_id__in=[1,4,5,6])
print(query.query)
print(query)
Which generates the query:
SELECT "col1",
"col2",
"col3"
FROM (
SELECT *
FROM (
SELECT "greatest_n_example"."id" AS "col1",
"greatest_n_example"."category" AS "col2",
"greatest_n_example"."version" AS "col3",
CASE
WHEN (ROW_NUMBER() OVER (
PARTITION BY "greatest_n_example"."category"
ORDER BY "greatest_n_example"."version" DESC
) = 1)
THEN "greatest_n_example"."id"
ELSE NULL
END AS "qual0"
FROM "greatest_n_example"
) "qualify"
WHERE "qual0" IN (1, 4, 5, 6)
) "qualify_mask"
And only outputs Examples 5 and 6.
You can use a raw query:
params = [1,4,5,6]
query = Example.objects.raw(
f"""
SELECT "id",
"category",
"version"
FROM (
SELECT g."id",
g."category",
g."version",
ROW_NUMBER() OVER (
PARTITION BY g."category"
ORDER BY g."version" DESC
) AS rn
FROM "{Example._meta.db_table}" g
)
WHERE rn = 1
AND "id" IN ({",".join("%s" for _ in params)})
""",
params
)
print(query.query)
print([r for r in query])
Note: May need to handle the special case when params is empty.
Which generates the query:
SELECT "id",
"category",
"version"
FROM (
SELECT g."id",
g."category",
g."version",
ROW_NUMBER() OVER (
PARTITION BY g."category"
ORDER BY g."version" DESC
) AS rn
FROM "greatest_n_example" g
)
WHERE rn = 1
AND "id" IN (1,4,5,6)
And only outputs Examples 5 and 6.