Why do filters affect the result of StringAgg?
I'm using StringAgg
and order
as follows:
# Get order column & annotate with list of credits
if request.POST.get('order[0][name]'):
order = request.POST['order[0][name]']
if order == 'credits_primary':
releases = releases.annotate(credits_primary=StringAgg(
'credits__entity__name',
delimiter=', ',
filter=Q(credits__type='primary'),
ordering='credits__id'
))
elif order == 'credits_secondary':
releases = releases.annotate(credits_secondary=StringAgg(
'credits__entity__name',
delimiter=', ',
filter=Q(credits__type='secondary'),
ordering='credits__id'
))
else:
order = 'title'
# Order releases
if request.POST.get('order[0][dir]') == 'desc':
releases = releases.order_by(F(order).desc(nulls_last=True), 'title')
else:
releases = releases.order_by(F(order).asc(nulls_last=True), 'title')
for release in releases:
try: print(release.credits_primary)
except: pass
try: print(release.credits_secondary)
except: pass
This in itself works exactly as expected: the ordering is what I expect, and print
returns the values I expect.
However, when I apply filter
s before this, it starts behaving strangely. Namely, sometimes it's fine and still works as expected, sometimes each credits__entity__name
is repeated a random number of times, sometimes the annotation just returns None
even though there are values. I can't figure out a pattern here.
Below are the filters I'm applying, note that exclude
as far as I can tell does not cause this problem:
if request.POST.get('entity'):
releases = Release.objects.filter(credits__entity=request.POST['entity'])
else:
releases = Release.objects.all()
records_total = releases.count()
# Filter by type
if request.POST.get('type'):
query = Q()
for type in loads(request.POST['type']):
if type in Release.TYPES_P_TO_S:
query.add(Q(type=Release.TYPES_P_TO_S[type]), Q.OR)
releases = releases.filter(query)
# Filter by platform
if request.POST.get('platform'): releases = releases.filter(platforms__in=loads(request.POST['platform']))
# Filter by format
if request.POST.get('format'): releases = releases.filter(formats__in=loads(request.POST['format']))
# Filter by From/Until year
if request.POST.get('from'): releases = releases.filter(date__year__gte=loads(request.POST['from'])[0])
if request.POST.get('until'): releases = releases.filter(date__year__lte=loads(request.POST['until'])[0])
# Exclude parent types
if request.POST.get('exclude'):
query = Q()
for type in loads(request.POST['exclude']):
if type in Release.PARENT_TYPES_P_TO_S:
query.add(Q(parent_type=Release.PARENT_TYPES_P_TO_S[type]), Q.OR)
releases = releases.exclude(query)
# Filter by Search
if request.POST.get('search[value]'):
query = Q()
search = request.POST['search[value]']
query.add(Q(title__icontains=search), Q.OR)
query.add(Q(tags__tag__name__icontains=search), Q.OR)
query.add(Q(credits__entity__name__icontains=search), Q.OR)
releases = releases.filter(query)
# Make sure items aren't repeated
releases = releases.distinct()
Not sure I understand your use case, but if the problem is that you get duplicates in your aggregation, maybe you can try to use distinct
on your string aggregations:
https://docs.djangoproject.com/fr/5.1/ref/contrib/postgres/aggregates/#stringagg
Below is the filter I'm applying that affects the aggregation (applied before StringAgg):
The filter looks at (other) one-to-many or many-to-many relations. This means that this will result in LEFT OUTER JOIN
that will start to act as a multiplier. The same works with other aggregates such as SUM
and COUNT
.
Indeed, if you just aggregate, the query looks like:
SELECT string_agg(CASE
WHEN c.type = 'primary' THEN e.name
ELSE NULL
END
ORDER BY e.id)
FROM releases AS r
LEFT OUTER JOIN credits AS c ON c.release_id = r.id
LEFT OUTER JOIN entity AS e ON e.credit_id = e.id
and that is fine, because we want to aggregate over the entities, so per release we get all entities.
But now if we for example filter on the tags, it looks like:
SELECT string_agg(CASE
WHEN c.type = 'primary' THEN e.name
ELSE NULL
END
ORDER BY e.id)
FROM releases AS r
LEFT OUTER JOIN credits AS c ON c.release_id = r.id
LEFT OUTER JOIN entity AS e ON e.credit_id = e.id
LEFT OUTER JOIN release_tag rt1 ON rt1.release_id = r.id
LEFT OUTER JOIN tag t1 ON rt1.tag_id = t1.id
LEFT OUTER JOIN release_tag rt2 ON rt2.release_id = r.id
LEFT OUTER JOIN tag t1 ON rt2.tag_id = t2.id
WHERE t1.name = 'tag1'
OR t2.name = 'tag2'
This means that the entity names for a release where one of the two tags match will be repeated once, but where both match, it will be included twice.
The Q(credits__entity__name__icontains)
makes it even worse: since this will join on the table c
, and thus entities of credits
that have two or more matches will be included that many times, whereas the entities of a release that does not match with a given credit, will no longer be included.
The least common divider of all these phenomena is probably that you should be very careful to make aggregates when you make JOIN
s especially if multiple tables on which the JOINs are caried our are involved.
What you probably can do is make a Subquery
expression [Django-doc] which is thus not impacted by the "outer" tables and the corresponding JOIN
.