Batching requests in Scrapy
A website scraping process may require thousands of HTTP requests. Say you have a long list of addresses, or you generate lots of requests with some linear process. If you just fire all requests as they are generated, you'll end up waiting hunderds of requests to be processed. The framework will queue them (internally) and perform according to CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN limitations. In this case it may take a long time for a request from being fired to be actually processed, which makes it difficult to track progress/debug errors.

Another way is to fire a fixed batch of requests, say 10, wait until they all are done then next 10 and so on. This is also useful, if you need some additional actions (for example extracting videos with an external utility) related to your requests to be finished before the next batch.

For this you use the spider_idle signal. The signal is sent when all requests are done and all items in pipelines are processed. So you first fire your 10 requests, then get to this signal, then continue requesting as in the code below. You can also display some diagnostics here, wait for other processes - whatever you want between the two request batches.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import scrapy

class Spider(scrapy.Spider):

    def __init__(self, **kw):
        dispatcher.connect(self._spider_idle, scrapy.signals.spider_idle)

def start_requests(self):
   requests = ...
   """make your requests"""
   return requests

def _spider_idle(self, spider):

    """requests are done now. log/display results summary here"""

    """wait for other threads/processes to finalize (if any)"""

    self.crawler.engine.crawl(scrapy.Request(...), spider)


This is especially useful, when you get a number of links from a page, request them all, process results with pipelines, then continue to the next page. In this case you provide the next page url to the crawl() function.