Downloading Files in Scrapy. MediaPipeline.
In Scraping Twitter Video article we did four consecutive steps, following the same pattern: -request page; -apply some xpath to result; -compose the link of next page. If you do this in your spider you would need 4 callbacks, each composing a new scrapy.Request object. Lots of code would be the same. Given, the extraction of a video may be a part of a larger process, I don't like having this code in a spider. I'm going to demonstrate how to do it in a pipeline, reusing as much code as possible. I will show how to use a pipeline for a single request, then how to subclass it to fire few request chains for Twitter videos scraping in parallel.

The Scrapy item pipeline, handy to make a single request outside of a spider is undocumented MediaPipeline. There are other pipelines (link) for file downloads but they do more than we need in this case, which is quiet common. Using undocumented pipeline is not safe, and if you know a better way of doing this, please write in comments.

Part 1. Using MediaPipeline for a single download

MediaPipeline is being used like follows. For example, you want to scrape a single video file per item and you extract the video url into item['video_url'] field.
-you return request with this url from get_media_requests()
-media_downloaded() is called on success with response object
-media_failed() is called on failure
-don't forget to activate your pipeline

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import os.path
import scrapy
from scrapy.pipelines.media import MediaPipeline

ITEMS_DIR = "./downloads"

class MyFilePipeline(MediaPipeline):

    def get_media_requests(self, item, info):
        if item['video_url']:
            video_url = item['video_url']
            print "MyFilePipeline downloading %s " % (video_url)
            request = scrapy.Request(
                url=video_url,
                method="GET",
                headers={
                    "Accept" : "*/*",
                    "User-Agent" : "Mozilla",
                },
                meta={ "item":item, }, 
            )
            return request

    def media_downloaded(self, response, request, info):

        (vpath, vname) = os.path.split(request.url)
        item = response.meta['item']
        with open(os.path.join(ITEMS_DIR, vname), "wb") as f: 
            f.write(response.body)

        print "MyFilePipeline download complete %s for %s" % (request.url, item['name'])

    def media_failed(self, failure, request, info):
        item = request.meta['item']
        print "MyFilePipeline download failed %s for %s" % (request.url, item['name'])


Details (by line number):
12 The output is most probably needed, because MediaPipeline does not provide one. Replace with logging at appropriate level.
20 Using meta to access request-related data. Saving the whole item reference is the most common way.
22 You can yield as many requests as you want
26 Extracting file name (vname) from request
28 Saving the download result to 'downloads' directory, should be created in the current directory. In real-life I usually have a field in item like item['path'] which is filled in by a custom filesystem pipeline. Don't forget to pass "b" to open() for binary mode (to write exactly what is received from a network).
35 You may need some field to identify an item. item['name'] is taken as an example. This is not required, but you most likely will need it if you have more than one file to download in an item.

NOTES:
1 An item will not be passed to further pipelines, until all requests from this instance are finished in either way
2 More requests from other items can be fired in a meantime (when requests from this instance are fired but not finished)
3 Duplicate requests are filtered out. I will tell how to override it in a separate article.