Scraping Twitter Video
Twitter video is scraped in four steps, given it is embedded in some other site. If you want to scrape directly from Twitter, you just start with the step 2. I will guide you through these steps.

The process is quiet straightforward. On each step you make an HTTP request and extract some data from the response. Using this data you compose the next request etc., getting the mp4 video url at the end.

1) We will take afl.com.au as example. We're scraping the first twitter video from this page http://www.afl.com.au/news/2016-03-28/after-the-siren-crunch-time-after-one-round-youd-better-believe-it. This is the video at the bottom of the screenshot 1.

(1)


First, get the page source, either making an http request from you source code, or using your browser capabilities.

Locate the twitter link in the page source (https://twitter.com/AFL/status/713646178999480320)
All you need is the twitter id number from the twitter link (713646178999480320).

(1.1)


Extract the link from the page source with the following xpath:
1
//blockquote[@class='twitter-video']/a[re:test(@href, 'status/\d+')]/@href

Result: https://twitter.com/AFL/status/713646178999480320
Details: re:test(@href, 'status/\d+') applies the regular expression status/\d+ to its href tag value. If tag contains an occurance of regexp, the node is chosen.

Extract the id from the link with the following regexp:
1
status/(\d+)

Result: 713646178999480320

2) If you're scraping from the twitter site directly, take the id from the page URL (screenshot 2). If you're scraping from AFL or some other site, now you have the twitter id from the previous step.

(2)


Next, compose the following link:
https://twitter.com/i/videos/tweet/ + twitter id
Result: https://twitter.com/i/videos/tweet/713646178999480320

Make an http request with this URL, use the resulting page source for the next step.

3) Now you have the player code of the following kind (screenshot 3).

(3)


You need the data-config JSON structure.
Extract it with the following xpath:
1
//div[@class='player-container']/@data-config

Result: {"disable_embed":"0",... "vmap_url":"",... }

Next, you need the vmap_url field value. You can convert the string with the data-config structure into Python dictionary with the following Python code (response is Scrapy framework HttpResponse object):
1
2
3
4
5
6
def extract_next_link(self, response):
            config_s = response.xpath("//div[@class='player-container']/@data-config").extract_first()
            if config_s:
                config_s = config_s.encode("ascii", "ignore") 
                config = json.loads(config_s)
                return config['vmap_url']


Details:
2 - extracting the structure as string
4 - converting unicode string to ascii. the second parameter "ignore" means ignoring symbols which cannot be converted
5 - deserializing ascii encoded string, containing json structure to python object. When you need an opposite conversion, use json.dumps() routine.
6 - now you have the python dict. The vmap_url field is at the top level, so return it with just a __getitem__ function (square brackets).

Make an http request with this URL, use the resulting page source for the next step.

4) Now you got an xml file with the video URL in tags
You can extract with xpath:
1
//MediaFile/text()

Result: https://snappytv-a.akamaihd.net/video/928000/420p420/2016-03-26T08-34-24.858Z--10.942.mp4?token=1468238049_c6bc333a97b0fd74b693bb6ce690a3c3

This is the link to the video you need.

In
Downloading Files in Scrapy. MediaPipeline and Automating Twitter Video Scraping articles I will show how to automate this steps with the Scrapy platform.