How to update large amount of records in chunks?

I have models:

class Partner(models.Model):
    sap_code = models.CharField(max_length=64, null=True, blank=True, verbose_name='sap id', default=uuid4)
    # another fields


class DeficienciesAct(models.Model):
    partner = models.ForeignKey('partners.Partner', null=True, on_delete=models.CASCADE)
    sap_code = models.CharField(max_length=64, null=True)
    # another fields

Data:

  • Partner:
id sap_code
1 123
2 124
... ...
  • DeficienciesAct:
id partner_id sap_code
1 null 123
2 null 333
... ... ...
500000 null 421

Let's imagine that in DeficienciesAct there are 500 000 records. Not all DeficienciesAct.sap_code can be in Partner.sap_code.

So I need to update field DeficienciesAct.partner_id by DeficienciesAct.sap_code. DeficienciesAct.sap_code is equal to Partner.sap_code. For that I developed a script:

DeficienciesAct.objects.filter(
        partner__isnull=True, sap_code__in=Subquery(Partner.objects.values_list("sap_code"))
    ).annotate(
        new_partner_id=Subquery(
            Partner.objects.filter(
                sap_code=OuterRef('sap_code')
            ).values('id')[:1]
        )
    ).update(partner_id=F("new_partner_id"))

Well that works. But I'm afraid that huge amount of records can affect a database (postgres).

Is there any way to do the task in chunks?

Please don't offer change models/tables.

Add an index on the sap_code field with db_index=True [Django-doc], like:

class Partner(models.Model):
    sap_code = models.CharField(
        max_length=64,
        null=True,
        blank=True,
        verbose_name='sap id',
        default=uuid4,
        db_index=True,
    )
    # another fields


class DeficienciesAct(models.Model):
    partner = models.ForeignKey(
        'partners.Partner', null=True, on_delete=models.CASCADE
    )
    sap_code = models.CharField(max_length=64, null=True)

this will boost lookups for a given sap_code.

But if sap_code is a unique field for a ForeignKey, we can just use that as value for the ForeignKey, with:

class Partner(models.Model):
    sap_code = models.CharField(
        max_length=64,
        verbose_name='sap id',
        default=uuid4,
        db_index=True,
        unique=True,
    )


class DeficienciesAct(models.Model):
    partner = models.ForeignKey(
        'partners.Partner', null=True, on_delete=models.CASCADE
    )
    sap_code = models.CharField(max_length=64, null=True)
    new_partner = models.ForeignKey(
        'partners.Partner', null=True, to_field='sap_code'
    )

We can then work with:

DeficienciesAct.objects.update(new_partner_id=F('sap_code'))

and eventually thus get rid of the "old" partner field, and rename new_parter to Partner and get rid of the sap_code field in DeficienciesAct. This because Django then uses the sap_code field to refer to a Partner.

That worked:

batch_size = 1000
deficiency_acts_for_update = DeficienciesAct.objects.filter(
    partner__isnull=True, sap_code__in=Subquery(Partner.objects.values_list("sap_code"))
).order_by('id')
while True:
    if not deficiency_acts_for_update.count():
        break
    deficiency_acts_for_update.filter(
        id__in=Subquery(deficiency_acts_for_update.values_list("id")[:batch_size])
    ).annotate(
        _partner_id=Subquery(
            Partner.objects.filter(
                sap_code=OuterRef('sap_code')
            ).values('id')[:1]
        )
    ).update(partner_id=F("_partner_id"))
Вернуться на верх