Career TrackerGithub entry Published with customer permission
This was a medium-sized project for the personal use of COO/CFO of sellingtoexecutives.com Jacques Sciammas.
They needed to monitor employees' career changes, so an automated scraper was developed. The idea is to scrape one major social network profiles to track changes in employee's companies and positions and store them in a database. The scraping process is run automatically on a weekly basis plus there is the button for user to request an immediate update. The database stores each employees company and position and when somebody gets a promotion (or changes his company) this change is reflected in a DB so that both new and old positions are displayed.
There are two ways of user output: a web application and a e-mail report. I will cover them both below.
Please see the screenshot of the frontend below. The color scheme is intended to mimic the scraped social network. The highlighted lines (orange) reflect recent career changes. Web application server is implemented with Python/Django.
The interesting part of a project was to implement client-server interaction taking into account the scraping process takes some time. Including randomized delays which help avoiding scraping detection, the full scan could take up to 20 minutes (for 300 records). This definetly should be run asynchronously. When a user presses "Update" button, the scraping task is started and the "Updating..." animated label is shown. The client then polls the server whether the update is complete.
To run the scraping task asynchronously, the Celery: Distributed Task Queue was utilized. This reqiures the Celery worker server to be run togegher with the main webapp server.
Celery requires a message broker to exchange messages so the RabbitMQ was chosen as a common solution. It took some time to configure them to connect to each other. The RabbitMQ installs as a service and runs automatically with system start.
The second way of reporting is much more minimal. The update process is run as a console application by Windows Scheduler once a week. Those employes which changed their position/company only are then reported by e-mail to the user.
The same code had to be reused for both console application and Django server. The problem is that Django server uses its custom environment to perform database read/writes. The console command needs the same environment to perform the sam datbase updates. The solution is to create a custom Django command, which actually runs the server environment, performs the task and exits.