Как перехватить ответ scrapy от паука в конвейер?

Мне нужен весь ответ scrapy с настройками, трубопроводами, урлами и всем, что находится в трубопроводе, где я создаю объекты модели? Есть ли способ поймать это?

pipeline.py


class ScraperPipeline(object):
    def process_item(self, item, spider):
        logger = get_task_logger("logs")
        logger.info("Pipeline activated.")
        id = item['id'][0]
        user= item['user'][0]
        text = item['text'][0]
        Mail.objects.create(user=User.objects.get_or_create(
            id=id, user=user),
            text=text, date=today)
        logger.info(f"Pipeline disacvtivated")

spider.py
class Spider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['xxx.com']

    def start_requests(self):
        urls = [
            'xxx.com',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse,
                                 headers={'User-Agent':
                                              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, '
                                              'like Gecko) Chrome/107.0.0.0 Safari/537.36'})

    def parse(self, response):
        item = MailItem()
        for row in response.xpath('xpath thins'):
            ip['id'] = row.xpath('td[1]//text()').extract_first(),
            ip['user'] = row.xpath('td[2]//text()').extract_first(),
            ip['text'] = row.xpath('td[3]//text()').extract_first(),
            yield item

Я пытался вызвать ответ из конвейера, но у меня есть только объект. Также мне не хватает вещей из созданного объекта. Есть идеи?

Вы можете передать полный ответ вместе с элементом в методы обратного вызова, если вам нужен доступ к ответу или запросу в вашем конвейере.

Например:

class SpiderClass(scrapy.Spider):
    ...
    ...

    def parse(self, response):
        for i in response.xpath(...):
            field1 = ...
            yield {'field1': field1, 'response': response}

Затем в вашем конвейере вы будете иметь доступ к ответу как к полю элемента в методе process_item. Вы также можете получить доступ к настройкам из этого метода, используя атрибут crawler аргумента spider.

Например:

class MyPipeline:

    def process_item(self, item, spider):
        response = item['response']
        request = response.request
        settings = spider.crawler.settings
        ...   do something 
        del item['response']
        return item

Затем вам просто нужно активировать трубопровод в настройках.

Вернуться на верх

Последние вопросы и ответы

In new versions of django, after creating/applying migrations/after creating a super-user, the server starts up by itself

iOS/web Auth Client ID Handling for Google Sign In

Flaky Circle CI tests (django): ImportError: cannot import name "task" from "app.tasks" (unknown location)

KeyError 'email' for django-authtools UserCreationForm

Django check at runtime if code is executed under "runserver command" or not

Django REST project doesn’t detect apps inside the “apps” directory when running makemigrations

Cannot query "admin": Must be "ChatMessage" instance in Django

Is it reasonable to use Cloud storage for async webhook processing on Cloud Run

Django ORM: Add integer days to a DateField to annotate next_service and filter it (PostgreSQL)

Stuck with django asgi server (dpahne) and aws eb (with docker)

Как перехватить ответ scrapy от паука в конвейер?

Последние вопросы и ответы

Рекомендуемые записи по теме