Update Parameter for Web Scraping With Infinite Scroll

I am unsure how I should structure my code here so that the offset parameter updates each time the function recursively calls itself. Here is more detail about my script and the challenge I'm trying to solve. I feel like there is some easy fix that I'm missing here. Scraping Website With Infinite Scroll Using Scrapy

import scrapy
import json
import requests

class LetgoSpider(scrapy.Spider):
    name = 'letgo'
    allowed_domains = ['letgo.com/en']
    start_urls = ['https://search-products-pwa.letgo.com/api/products?country_code=US&offset=0&quadkey=0320030123201&num_results=50&distance_type=mi']

    def parse(self, response):
        data = json.loads(response.text)
        for used_item in data:
            if len(data) == 0:
                break
            try:
                title = used_item['name']
                price = used_item['price']
                description = used_item['description']
                date = used_item['updated_at']
                images = [img['url'] for img in used_item['images']]
                latitude = used_item['geo']['lat']
                longitude = used_item['geo']['lng']               
            except Exception:
                pass

        yield {'Title': title,
               'Price': price,
               'Description': description,
               'Date': date,
               'Images': images,
               'Latitude': latitude,
               'Longitude': longitude          
               }    

        i = 0
        for new_items_load in response:
            i += 50 
            offset = i
            new_request = 'https://search-products-pwa.letgo.com/api/products?country_code=US&offset=' + str(i) + \
                          '&quadkey=0320030123201&num_results=50&distance_type=mi'
            yield scrapy.Request(new_request, callback=self.parse)

1 answer

  • answered 2018-02-13 01:03 BUZZY

    Define offset as a class attribute:

    class LetgoSpider(scrapy.Spider):
        name = 'letgo'
        allowed_domains = ['letgo.com/en']
        start_urls = ['https://search-products-pwa.letgo.com/api/products?country_code=US&offset=0&quadkey=0320030123201&num_results=50&distance_type=mi']
        offset = 0  # <- here
    

    Then, you can reffer to it using self.offset and the value will be shared accross all function parse invokes. So it'd be something like this:

    self.offset += 50
    new_request = 'https://search-products-pwa.letgo.com/api/products?country_code=US&offset=' + str(self.offset) + \
                          '&quadkey=0320030123201&num_results=50&distance_type=mi'
    yield scrapy.Request(new_request, callback=self.parse)