This repository holds the code for scrapers built under the project "Scrape the Planet"
- Methods used for scraping : Scrapy
- Language used for scraping : Python3.X.X
Minutes of the meeting: http://bit.ly/scrapeThePlanet
Modify Hosts File scrapeNews/install/ansible/hosts and use ansible-playbook install.yaml -i hosts to automatically deploy the complete application on server.
For customising other vars like username, deploy_base, etc, edit the scrapeNews/install/ansible/group_vars/all file.
Clone the repository (or download it). Then, follow the installation steps to run the spiders.
ROOT
|__ deploy
|_____ env.sh
|_____ web
|_______ web_app
|_______ scrapeNews
|_________ db.py
|_________ settings.py
|_____ logs
|_____ start
|_____ VENV_NAME
|__ scrape
path: ROOT/deploy/VENV_NAME
python3 -m venv VENV_NAMEWindows: VENV_NAME/Scripts/activate
Linux: source VENV_NAME/bin/activate
Navigate to repository: pip install -r requirements.txt
- Requirements(For scraping):
- scrapy
- requests
- python-dateutil
- Requirements (For database):
- psycopg2
- Requirements (For flask Application):
- flask
- gunicorn
- Requirements (For deploying and scheduling spiders)
- scrapyd
- git+https://github.com/scrapy/scrapyd-client
- schedule
sudo apt-get install torsudo apt-get install privoxyAdd following lines at the end of /etc/privoxy/config
forward-socks5 / 127.0.0.1:9050 .
forward-socks4a / 127.0.0.1:9050 .
forward-socks5t / 127.0.0.1:9050 .sudo service tor start
sudo service privoxy startsudo apt-get install postgresql postgresql-contribNote: Your USERNAME and PASSWORD must contain only smallcase characters.
sudo -i -u postgres
createuser YOUR_ROLE_NAME/YOUR_USERNAME --interactive --pwpromptpath: ROOT/deploy/env.sh
# Set The Environment Variables
export SCRAPER_DB_HOST="ENTER_VALUE_HERE"
export SCRAPER_DB_USER="ENTER_VALUE_HERE"
export SCRAPER_DB_PASS="ENTER_VALUE_HERE"
export SCRAPER_DB_NAME="ENTER_VALUE_HERE"
export SCRAPER_DB_TABLE_NEWS="ENTER_VALUE_HERE"
export SCRAPER_DB_TABLE_SITE="ENTER_VALUE_HERE"
export SCRAPER_DB_TABLE_LOG="ENTER_VALUE_HERE"
export FLASK_APP="ROOT/deploy/web/web_app/server.py"Copy ROOT/scrape/scrapeNews/web_app/* to ROOT/deploy/web/web_app/*
Copy ROOT/scrape/scrapeNews/scrapeNews/settings.py to ROOT/deploy/web/scrapeNews/
Copy ROOT/scrape/scrapeNews/scrapeNews/db.py to ROOT/deploy/web/scrapeNews/
path: ROOT/deploy/
sudo apt-get install nginx
source ROOT/deploy/VENV_NAME/bin/activate
pip install gunicorn
Replace Vars in ROOT/scrape/scrapeNews/install/standard/nginx-site/scraper and copy it to /etc/nginx/sites-available/ and create symbolic link:
sudo ln -s /etc/nginx/sites-available/scraper /etc/nginx/sites-enabled/scraper
sudo service nginx restart
Copy startup script from ROOT/scrape/scrapeNews/install/standard/start to ROOT/deploy/start
Replace Vars in ROOT/scrape/scrapeNews/install/standard/systemd/scraper.service and copy it to /etc/systemd/system/, enable the service:
sudo systemctl daemon-reload
sudo systemctl restart scraper
sudo systemctl enable scraper
Note: Navigate to the folder containing scrapy.cfg
path: ROOT\scrape\scrapeNews
scrapy crawl SPIDER_NAME1. indianExpressTech
2. indiaTv
3. timeTech
4. ndtv
5. inshorts
6. zee
7. News18Spider
8. moneyControl
9. oneindia
10. oneindiaHindi
11. firstpostHindi
12. firstpostSports
13. newsx
14. hindustan
15. asianage
16. timeNews
17. newsNation
To set the number of pages to be scraped use -a pages = X (X = Number of pages to scrape).
Applicable for:
- indianExpressTech
- indiaTv
- timeTech
- moneyControl
- oneindia
- oneindiaHindi
- firstpostHindi
- firstpostSports
- newsx
- asianage
- ndtv
- timeNews
To set the number of pages to be scraped use -a offset = X (X = Number of pages to skip).
Applicable for:
- indianExpressTech
- indiaTv
- timeTech
- moneyControl
- oneindia
- oneindiaHindi
- firstpostHindi
- firstpostSports
- newsx
- asianage
- timeNews