Django Strategies for processing large datasets from an external API without a local replica table [closed]

I am building an SMS gateway system in Django that acts as an interface between university data sources (e.g., student information systems) and an SMS provider.

The Architecture: To ensure data freshness and avoid redundancy, we decided not to maintain a local Recipient table (mirroring the external database). Instead, we use a Stateless/Proxy architecture:

  1. User selects filters (e.g., "Faculty: Engineering") in the frontend.

  2. Backend fetches the student list in real-time from the external API.

  3. We iterate through the results and create a "Snapshot" log (SmsDispatch model) for history/reporting.

  4. We send the SMS.

The Problem: The external API might return large datasets (e.g., 20,000+ students). I am concerned about:

  1. Memory (OOM): Loading 20k objects into a Python list before saving to the DB.

  2. Timeouts: The external API being slow, causing the HTTP request or the Celery task to hang.

The AI suggested Approach: plan to use Celery with Python Generators and Django's bulk_create to stream data and write in batches.

My Question is:

  1. Is the chose of not saving the data local a good way?

  2. Is this "Generator + Batch Insert" pattern the standard way to handle this in Django/Celery to prevent OOM errors?

  3. How should I handle Partial Failures? If the external API fails on page 5 of 10, how do I resume without duplicating the SMS for the first 4 pages?

  4. Is the decision to NOT save recipients locally a viable strategy for this scale (20k+ users), or is the reliance on an external API for every campaign considered an anti-pattern in production SMS systems?

Вернуться на верх