Mirror0
Github entry Published with customer permission

The first development phase of this project lasted during May-June 2016. I worked with one Test Engineer who run the project in a remote data center and set up the server (Ubuntu) according to my installation instructions. It is a subcontract project with Ad Hippo, creating a system analyzing brand appearance in different media sources, thus analyzing the impact of an advertising campaign.

The customer needs the system performing data collection from a set of large sport news sites including article texts, images and videos. The system inlcludes custom fingerprinting mechanism, so that, being run on a daily basis it downloads only newly added pages. It was implemented with Scrapy platform and Python 2.7 language.

I've developed a generic spider, which starts from each predefined website category, extracts links in a batch and turns pages when needed. After a batch of articles is scraped completely spider proceeds with the next ones. The Scrapy framework itself would take an unlimited number of HTTP requests into processing, so an additional batching algorithm was implemented to ease the progress control and user output.

The generic spider was then easily adjusted to each website. In addition to it, a set of pipelines was developed for each resource type. Media (videos) extraction is the most tricky part in each site scraping. For each video type a pipeline was developed so that a proper subset of pipelines suffices each of the project websites.

For Site1 the
youtube-dl library is used to download videos. This is the complex utility which can be run in a separate process, generating its own logs. The corresponding pipeline runs a limited number of youtube-dl processes, each extracting and downloading videos from one page. Before running the next article batch the spider waits for all current batch videos to be downloaded. Each video process generates a separate log file, useful for diagnostics. The python subprocess module is utilized and multiprocessed pipeline is easily integrated in Scrapy 's single-threaded asynchronous environment.



If the same video is present in two or more articles, a symbolink link is put into results directories, pointing to the first video occurence.

Site2 contains videos of a few types. One of them is Twitter videos, which required a set of consequtive requests and is described in Scraping Twitter Videos article.

Another video type couldn't be retrieved by just analyzing the XML/JSON data of HTTP responses. It required a complex JavaScript code to run, so the Selenium platform was chosen to completely emulate all browser activity including JavaScript. ChromeDriver is used in conjunction with Selenium to emulate Chromium browser. The certain options were discovered to make ChromeDriver work the right way, no matter they are not well documented. The mp4 download link can only be retrieved in HTML5 fallback mode, which is enabled by fully disabling Flash player on all levels (Shockwave and Adobe plugins plus bundled player). The driver is run without window, with sounds and other resources disabled to save bandwith and suppress unneeded output. Selenium is a heavy-weight solution, so it is run only for a limited number of videos, which couldn't be extracted by other means.

Scraping results are saved in user-friendly directory structure. The directories are set up in .ini file. For example, if user puts $HOME/.mirror0/DATA path into .ini file, so the resources for an article from yahoo sports would be saved to "~/.mirror0/DATA/au.sports.yahoo.com/Eagles premiership winners to be honoured", where the last part is the article name. Each directory contains article text, pictures, numbered in order of appeareance on the page, videos and metadata file.

  

The fingerprinting system is implemented to avoid re-processing articles, scraped on previous runs. To make a fingerprint, the CRC64 sum of the url string is computed by the means of crcmod library. The collision (false matching) of two random urls U1 and U2 has the probability
$$P(crc64(U1) = crc64(U2)) = \frac{1}{2^{64}},$$
admitting CRC has uniform distribution and urls has 1-2083 length. It was considered sufficient, given there are only thousands of articles in each website. The small size of each fingerpint causes minimal RAM/CPU consumption on comparison and database load/save operations.



An additional log displays per-article report. Each module, extracting a separate resource type (pipeline) reports on its start and finish, so for developer it is easy to see which components were found on a page. This format also shows clearly if some component failed to extract.