Django crawler просто соскабливает первую страницу

У меня есть краулер, который прекрасно работает на простом Python, но при запуске его на Django внутри приложения он захватывает только ссылки с первой страницы. Для запуска краулера я выполняю celery -A src worker -l INFO.

Есть идеи, почему он начинает работать, но внезапно останавливается? Может быть, это проблема Django?

Редактирование: Вот проект:

|-- CrawlerWebsite\ Website
|   |-- crawled.txt
|   `-- queue.txt
|-- celery.log
|-- celerybeat-schedule.db
|-- db.sqlite3
|-- manage.py
|-- requirements.txt
|-- scraper
|   |-- __init__.py
|   |-- admin.py
|   |-- apps.py
|   |-- migrations
|   |   |-- 0001_initial.py
|   |   |-- __init__.py
|   |-- models
|   |   |-- __init__.py
|   |   `-- models.py
|   |-- spyder
|   |   |-- __init__.py
|   |   |-- domain.py
|   |   |-- general.py
|   |   |-- link_finder.py
|   |   `-- spider.py
|   |-- tasks.py
|   |-- tests.py
|   |-- urls.py
|   `-- views.py
|-- src
|   |-- __init__.py
|   |-- asgi.py
|   |-- celery.py
|   |-- settings.py
|   |-- urls.py
|   `-- wsgi.py
`-- templates
    `-- index.html

tasks.py вызывает весь spyder модуль, содержащий программу (в модулях)

from celery import shared_task
import threading
from queue import Queue
from scraper.spyder import domainTool, fileTool, Spider

from .models.models import Portal, urlList

PROJECT_NAME = 'Website to crawl'
HOMEPAGE = 'https://www.examplesite.com'
DOMAIN_NAME = domainTool.get_domain_name(HOMEPAGE)
QUEUE_FILE = PROJECT_NAME + '/queue.txt'
CRAWLED_FILE = PROJECT_NAME + '/crawled.txt'
NUMBER_OF_THREADS = 8
queue = Queue()
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)


@shared_task
def create_workers():
    for _ in range(NUMBER_OF_THREADS):
        t = threading.Thread(target=work)
        t.daemon = True
        t.start()


def work():
    while True:
        url = queue.get()
        Spider.crawl_page(threading.current_thread().name, url)
        queue.task_done()


def create_jobs():
    for link in fileTool.file_to_set(QUEUE_FILE):
        queue.put(link)
    queue.join()
    crawl()


@shared_task
def crawl():
    queued_links = fileTool.file_to_set(QUEUE_FILE)
    if len(queued_links) > 0:
        print(str(len(queued_links)) + ' links in the queue')
        create_jobs()

Вернуться на верх

Последние вопросы и ответы

Deploy Django and Nginx under subpath

Django transaction.atomic() on single operation prevents race conditions?

can't find xgettext or msguniq but gettext-base is installed

Encoding full payload and decoding in server in REST

Deploying Dockerized (React + Django + PostgreSQL ) app with custom license to a client without exposing source code

How to Avoid JWT Collision While Receiving Bearer Token

Can't insert rows into Supabase profile table even after creating the RLS policy to do so for the sign up feature

Is it possible to force mysql server authentication using django.db.backends.mysql?

How to reuse a Django model for multiple relationships

How to aggregate a group by query in django?

Django crawler просто соскабливает первую страницу

Последние вопросы и ответы

Рекомендуемые записи по теме