How to update large amount of records in chunks?
I have models:
class Partner(models.Model):
sap_code = models.CharField(max_length=64, null=True, blank=True, verbose_name='sap id', default=uuid4)
# another fields
class DeficienciesAct(models.Model):
partner = models.ForeignKey('partners.Partner', null=True, on_delete=models.CASCADE)
sap_code = models.CharField(max_length=64, null=True)
# another fields
Data:
- Partner:
id | sap_code |
---|---|
1 | 123 |
2 | 124 |
... | ... |
- DeficienciesAct:
id | partner_id | sap_code |
---|---|---|
1 | null | 123 |
2 | null | 333 |
... | ... | ... |
500000 | null | 421 |
Let's imagine that in DeficienciesAct there are 500 000 records. Not all DeficienciesAct.sap_code can be in Partner.sap_code.
So I need to update field DeficienciesAct.partner_id by DeficienciesAct.sap_code. DeficienciesAct.sap_code is equal to Partner.sap_code. For that I developed a script:
DeficienciesAct.objects.filter(
partner__isnull=True, sap_code__in=Subquery(Partner.objects.values_list("sap_code"))
).annotate(
new_partner_id=Subquery(
Partner.objects.filter(
sap_code=OuterRef('sap_code')
).values('id')[:1]
)
).update(partner_id=F("new_partner_id"))
Well that works. But I'm afraid that huge amount of records can affect a database (postgres).
Is there any way to do the task in chunks?
Please don't offer change models/tables.
Add an index on the sap_code
field with db_index=True
[Django-doc], like:
class Partner(models.Model):
sap_code = models.CharField(
max_length=64,
null=True,
blank=True,
verbose_name='sap id',
default=uuid4,
db_index=True,
)
# another fields
class DeficienciesAct(models.Model):
partner = models.ForeignKey(
'partners.Partner', null=True, on_delete=models.CASCADE
)
sap_code = models.CharField(max_length=64, null=True)
this will boost lookups for a given sap_code
.
But if sap_code
is a unique field for a ForeignKey
, we can just use that as value for the ForeignKey
, with:
class Partner(models.Model):
sap_code = models.CharField(
max_length=64,
verbose_name='sap id',
default=uuid4,
db_index=True,
unique=True,
)
class DeficienciesAct(models.Model):
partner = models.ForeignKey(
'partners.Partner', null=True, on_delete=models.CASCADE
)
sap_code = models.CharField(max_length=64, null=True)
new_partner = models.ForeignKey(
'partners.Partner', null=True, to_field='sap_code'
)
We can then work with:
DeficienciesAct.objects.update(new_partner_id=F('sap_code'))
and eventually thus get rid of the "old" partner
field, and rename new_parter
to Partner
and get rid of the sap_code
field in DeficienciesAct
. This because Django then uses the sap_code
field to refer to a Partner
.
That worked:
batch_size = 1000
deficiency_acts_for_update = DeficienciesAct.objects.filter(
partner__isnull=True, sap_code__in=Subquery(Partner.objects.values_list("sap_code"))
).order_by('id')
while True:
if not deficiency_acts_for_update.count():
break
deficiency_acts_for_update.filter(
id__in=Subquery(deficiency_acts_for_update.values_list("id")[:batch_size])
).annotate(
_partner_id=Subquery(
Partner.objects.filter(
sap_code=OuterRef('sap_code')
).values('id')[:1]
)
).update(partner_id=F("_partner_id"))