Why do filters affect the result of StringAgg?

I'm using StringAgg and order as follows:

    # Get order column & annotate with list of credits
    if request.POST.get('order[0][name]'):
        order = request.POST['order[0][name]']
        if order == 'credits_primary':
            releases = releases.annotate(credits_primary=StringAgg(
                'credits__entity__name',
                delimiter=', ',
                filter=Q(credits__type='primary'),
                ordering='credits__id'
            ))
        elif order == 'credits_secondary':
            releases = releases.annotate(credits_secondary=StringAgg(
                'credits__entity__name',
                delimiter=', ',
                filter=Q(credits__type='secondary'),
                ordering='credits__id'
            ))
    else:
        order = 'title'
    
    # Order releases
    if request.POST.get('order[0][dir]') == 'desc':
        releases = releases.order_by(F(order).desc(nulls_last=True), 'title')
    else:
        releases = releases.order_by(F(order).asc(nulls_last=True), 'title')
    
    for release in releases:
        try: print(release.credits_primary)
        except: pass
        try: print(release.credits_secondary)
        except: pass

This in itself works exactly as expected: the ordering is what I expect, and print returns the values I expect.

However, when I apply filters before this, it starts behaving strangely. Namely, sometimes it's fine and still works as expected, sometimes each credits__entity__name is repeated a random number of times, sometimes the annotation just returns None even though there are values. I can't figure out a pattern here.

Below are the filters I'm applying, note that exclude as far as I can tell does not cause this problem:

    if request.POST.get('entity'):
        releases = Release.objects.filter(credits__entity=request.POST['entity'])
    else:
        releases = Release.objects.all()
        
    records_total = releases.count()
        
    # Filter by type
    if request.POST.get('type'):
        query = Q()
        for type in loads(request.POST['type']):
            if type in Release.TYPES_P_TO_S:
                query.add(Q(type=Release.TYPES_P_TO_S[type]), Q.OR)
        releases = releases.filter(query)
        
    # Filter by platform
    if request.POST.get('platform'): releases = releases.filter(platforms__in=loads(request.POST['platform']))
        
    # Filter by format
    if request.POST.get('format'): releases = releases.filter(formats__in=loads(request.POST['format']))
        
    # Filter by From/Until year
    if request.POST.get('from'): releases = releases.filter(date__year__gte=loads(request.POST['from'])[0])
    if request.POST.get('until'): releases = releases.filter(date__year__lte=loads(request.POST['until'])[0])
        
    # Exclude parent types
    if request.POST.get('exclude'):
        query = Q()
        for type in loads(request.POST['exclude']):
            if type in Release.PARENT_TYPES_P_TO_S:
                query.add(Q(parent_type=Release.PARENT_TYPES_P_TO_S[type]), Q.OR)
        releases = releases.exclude(query)
        
    # Filter by Search
    if request.POST.get('search[value]'):
        query = Q()
        search = request.POST['search[value]']
        query.add(Q(title__icontains=search), Q.OR)
        query.add(Q(tags__tag__name__icontains=search), Q.OR)
        query.add(Q(credits__entity__name__icontains=search), Q.OR)
        releases = releases.filter(query)
        
    # Make sure items aren't repeated
    releases = releases.distinct()

Not sure I understand your use case, but if the problem is that you get duplicates in your aggregation, maybe you can try to use distinct on your string aggregations:

https://docs.djangoproject.com/fr/5.1/ref/contrib/postgres/aggregates/#stringagg

Below is the filter I'm applying that affects the aggregation (applied before StringAgg):

The filter looks at (other) one-to-many or many-to-many relations. This means that this will result in LEFT OUTER JOIN that will start to act as a multiplier. The same works with other aggregates such as SUM and COUNT.

Indeed, if you just aggregate, the query looks like:

SELECT string_agg(CASE
                      WHEN c.type = 'primary' THEN e.name
                      ELSE NULL
                  END
                  ORDER BY e.id)
FROM releases AS r
LEFT OUTER JOIN credits AS c ON c.release_id = r.id
LEFT OUTER JOIN entity AS e ON e.credit_id = e.id

and that is fine, because we want to aggregate over the entities, so per release we get all entities.

But now if we for example filter on the tags, it looks like:

SELECT string_agg(CASE
                      WHEN c.type = 'primary' THEN e.name
                      ELSE NULL
                  END
                  ORDER BY e.id)
FROM releases AS r
LEFT OUTER JOIN credits AS c ON c.release_id = r.id
LEFT OUTER JOIN entity AS e ON e.credit_id = e.id
LEFT OUTER JOIN release_tag rt1 ON rt1.release_id = r.id
LEFT OUTER JOIN tag t1 ON rt1.tag_id = t1.id
LEFT OUTER JOIN release_tag rt2 ON rt2.release_id = r.id
LEFT OUTER JOIN tag t1 ON rt2.tag_id = t2.id
WHERE t1.name = 'tag1'
  OR t2.name = 'tag2'

This means that the entity names for a release where one of the two tags match will be repeated once, but where both match, it will be included twice.

The Q(credits__entity__name__icontains) makes it even worse: since this will join on the table c, and thus entities of credits that have two or more matches will be included that many times, whereas the entities of a release that does not match with a given credit, will no longer be included.

The least common divider of all these phenomena is probably that you should be very careful to make aggregates when you make JOINs especially if multiple tables on which the JOINs are caried our are involved.

What you probably can do is make a Subquery expression [Django-doc] which is thus not impacted by the "outer" tables and the corresponding JOIN.

Вернуться на верх