Why does duplicating a Django filter condition change PostgreSQL’s query plan estimates?

I'm working on a Django project with PostgreSQL and have implemented a custom filter to select "active" records. The filter is defined as follows:

def get_custom_filter(qs):
    return qs.filter(
        Q(
            Q(timestamp_field__lte=timezone.now(), related_obj__delay__isnull=True)
            | Q(
                related_obj__delay__lte=timezone.now(),
                type_obj__flag=False,
                timestamp_field__isnull=False,
            )
            | Q(type_obj__flag=True, timestamp_field__lte=timezone.now())
        )
    )

I then create a custom filter backend:

class CustomFilterBackend(filters.BaseFilterBackend):
    def filter_queryset(self, request, qs, view):
        return get_custom_filter(qs)

In my view, I include this filter in the filter_backends list. Here’s what happens:

When I include CustomFilterBackend only once, the PostgreSQL query plan shows that after filtering the estimated number of rows for sorting/uniquing is around 7,200 rows. For example, the WHERE clause appears as:

WHERE ("tenant_table"."tenant_id" IN (1)
       AND (("main_table"."delay_field" IS NULL
             AND "primary_table"."timestamp_field" <= TIMESTAMP)
            OR (NOT "type_table"."flag"
                AND "main_table"."delay_field" <= TIMESTAMP
                AND "primary_table"."timestamp_field" IS NOT NULL)
            OR ("type_table"."flag"
                AND "primary_table"."timestamp_field" <= TIMESTAMP))
       AND NOT ( ... ))

When I duplicate the filter (include it twice), the plan shows an estimated row count of about 2,400—roughly one third of the original estimate. The WHERE clause now contains two copies of the active filter condition:

WHERE ("tenant_table"."tenant_id" IN (1)
       AND (("main_table"."delay_field" IS NULL
             AND "primary_table"."timestamp_field" <= TIMESTAMP)
            OR (NOT "type_table"."flag"
                AND "main_table"."delay_field" <= TIMESTAMP
                AND "primary_table"."timestamp_field" IS NOT NULL)
            OR ("type_table"."flag"
                AND "primary_table"."timestamp_field" <= TIMESTAMP))
       AND (("main_table"."delay_field" IS NULL
             AND "primary_table"."timestamp_field" <= TIMESTAMP)
            OR (NOT "type_table"."flag"
                AND "main_table"."delay_field" <= TIMESTAMP
                AND "primary_table"."timestamp_field" IS NOT NULL)
            OR ("type_table"."flag"
                AND "primary_table"."timestamp_field" <= TIMESTAMP))
       AND NOT ( ... ))

Logically, repeating the same condition (i.e. X AND X) should not change the result set. However, PostgreSQL’s planner multiplies the selectivity when it sees the condition twice—if one application of the filter is estimated to pass 50% of rows, duplicating it would estimate 0.5 × 0.5 = 25% of rows. This drastically reduces the number of rows for later operations (such as sorting and uniquing) and leads to a lower overall cost estimate.

My questions are:

  1. Is it correct that PostgreSQL multiplies selectivity when the same condition appears twice in the WHERE clause?
  2. How does PostgreSQL calculate selectivity in these cases?
  3. Is there a recommended way to force the optimizer to obtain the same lower row estimate without duplicating the filter condition in the code?
Вернуться на верх