Scrapy combine items from multiple processes

I have a scrapy script that

  1. Finds all 'pages' nodes in an xml file
  2. Parses all those pages, collects data, finds additional pages
  3. Additional pages are further parsed and information is collected

Scrapy script:

class test_spider(XMLFeedSpider):
 name='test'
 start_urls=['https://www.example.com'] 
 custom_settings={
  'ITEM_PIPELINES':{
   'test.test_pipe': 100,
  },
 }
 itertag='pages'  
 def parse1(self,response,node):
  yield Request('https://www.example.com/'+node.xpath('@id').extract_first()+'/xml-out',callback=self.parse2)
 def parse2(self,response):
  yield{'COLLECT1':response.xpath('/@id').extract_first()} 
  for text in string.split(response.xpath(root+'/node[@id="page"]/text()').extract_first() or '','^'):
   if text is not '':
    yield Request(
     'https://www.example.com/'+text,
     callback=self.parse3,
     dont_filter=True
    )
 def parse3(self,response):
  yield{'COLLECT2':response.xpath('/@id').extract_first()} 
class listings_pipe(object):
 def process_item(self,item,spider):
  pprint(item)

Ideal result would be combined dict item such as

{'COLLECT1':'some data','COLLECT2':['some data','some data',...]}

Is there a way to call pipeline after each parse1 event? and get combined dict of items?

1 answer

  • answered 2019-01-11 06:09 ThunderMind

    In your Parse2 method, use meta an pass you collection1 to parse3 using meta. Then in Parse3 acquire your collection1, extract your collection2 and yield combine result as you wish.

    For more info on meta you can read here