SepSpider: Spider for UCAS SEP platform
SepSpider is a spider based on scrapy and selenium for automatically crawling and updating UCAS class resources. It's also suitable for server deployment.
- Linux (Ubuntu recommended)
- Python3
- scrapy
- pillow
- selenium
- Firefox
- Set up account info at
sep_spider/custom_setting.py.
# sep user info
SEP_USER = 'your sep account (email)'
SEP_PASSWD = 'your password'
# The spider use Yundama for Captcha recognition.
# If you don't have one, create an User account (NOT a Developer one) at:
# http://www.yundama.com/
# and add some credits to it.
# yundama user info
YDM_INFO = {
'username': 'yundama username',
'password': 'yundama password',
'appid': 8741,
'appkey': 'c2f7b93e58d4721079da3822c7aad5d4'
}
# where to store you file
CUSTOM_FILES_STORE = 'sep/'
# touch reload path used by listener
RELOAD_PATH = './reload'Yundama account is needed for Captcha recognition. Create an User (NOT Developer) account at http://www.yundama.com/ and add a little credits to it. 5 RMB is enough for a year's use.
- run spider
cd sep_spider
scrapy crawl sep_spiderThe default location to store files is ./sep. It can be customized in sep_spider/custom_setting.py.
For server deployment, you can run listener.py to listen to the changes of reload file as the signal of recrawling. Recrawling will only download non-existing files. It won't modify existing files.
- Set up your environment and account info (see above)
- Run
listener.py
python listener.py- Everytime you modify the
reloadfile, the spider will recrawl the resources. The location of reload file is customizable insep_spider/custom_setting.py.
touch reload