What order is followed when Scrapy CrawlSpider scraping pages?

I am new to Scrapy and reading Learing Scrapy to study, and I have a question about the scrape order.

The book provide a piece of code:

rules = (
    Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
    Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
        callback='parse_item')
)

And it said that Scrapy using a LIFO strategy to crawl. So I suppose that the first item should be the item on the last page, but it turns out the first item is on the first page.

Why? According to the code, I think Scrapy will keep following the first rule until it find the last page, and then it will start to parse items on the last page. I am confused.

And if a website has millions of pages, Scrapy won't parse any items until it reaches the last page?

1 answer

  • answered 2018-05-16 10:49 Granitosaurus

    All of the rules are being followed on every page in order of the tuple.

    For example you have two rules:

    1. find other pagination pages (no callback)
    2. find products (with callback)

    If you run this spider on the 1st page it will find other pagination urls and schedule them then find products and schedule them with parse_product callback or whatever you have set. Afterwards for any scheduled url that has default callback (where you haven't specificied callback argument) it will repeat this untill nothing is found anymore.