A comprehensive collection of web scraping scripts for extracting data from popular websites. This project demonstrates various web scraping techniques using Python and provides ready-to-use scripts for data extraction.

- Multiple Website Support: Scrape data from 10+ popular websites
- CSV Output: All scrapers export data in CSV format for easy analysis
- Easy to Use: Simple Python scripts with clear documentation
- Educational: Perfect for learning web scraping techniques
- Open Source: Contribute and improve the collection
pip install requests beautifulsoup4 lxml-
Clone the repository
git clone https://github.com/amolsr/web-scrapping.git cd web-scrapping -
Run any scraper
python scrapers/ecommerce/flipkart.py
-
Check the output
ls output/
Rank,Name,Year,Rating,Link,Director
1,The Shawshank Redemption,1994,9.2,https://www.imdb.com/title/tt0111161/,Frank Darabont
2,The Godfather,1972,9.2,https://www.imdb.com/title/tt0068646/,Francis Ford Coppola
Mobile Name,Ratings,Pricing,Description
Nokia 8.1,4.3,βΉ15,999,6GB RAM | 128GB Storage
Nokia 6.1 Plus,4.2,βΉ12,999,4GB RAM | 64GB Storage
# Run a specific scraper
python scrapers/content/imdb.py
# The script will automatically:
# 1. Fetch data from the website
# 2. Parse the HTML content
# 3. Extract relevant information
# 4. Save to CSV file in the output/ directoryEach script can be easily modified to:
- Change the target URL
- Extract different data fields
- Modify the output format
- Add error handling
web-scrapping/
βββ scrapers/ # All scraper scripts organized by category
β βββ ecommerce/ # E-commerce website scrapers
β β βββ flipkart.py # Flipkart smartphone scraper
β β βββ amazon.py # Amazon product scraper
β β βββ olx.py # OLX listings scraper
β βββ job_boards/ # Job board scrapers
β β βββ indeed.py # Indeed job listings
β β βββ naukri_jobs.py # Naukri job listings
β β βββ apnajob.py # ApnaJob listings
β β βββ jobhai.py # JobHai listings
β β βββ welcome_to_the_jungle.py # Welcome to the Jungle jobs
β β βββ craigslist_jobs.py # Craigslist jobs
β βββ educational/ # Educational platform scrapers
β β βββ udemy.py # Udemy course scraper
β β βββ sanfoundry.py # Sanfoundry educational content
β β βββ college_notice_scraper.py # College notices scraper
β β βββ javaguide.py # Java Guide content
β β βββ indiabix_networking.py # IndiaBix networking Q&A
β βββ social_media/ # Social media and developer platforms
β β βββ youtube.py # YouTube video scraper
β β βββ youtube_links.py # YouTube links extractor
β β βββ reddit.py # Reddit posts scraper
β β βββ hackernews.py # Hacker News posts
β β βββ stack_overflow.py # Stack Overflow questions
β β βββ github.py # GitHub repository scraper
β βββ content/ # Content and media scrapers
β β βββ imdb.py # IMDB top movies scraper
β β βββ books_toscrape.py # Books.toscrape.com scraper
β β βββ quotes_toscrape.py # Quotes to scrape
β β βββ wikipedia.py # Wikipedia table scraper
β β βββ openlibrary_books.py # Open Library books
β βββ misc/ # Miscellaneous scrapers
β β βββ coinmarketcap.py # Cryptocurrency market data
β β βββ weather.py # Weather information scraper
β β βββ craigslist_housing.py # Craigslist housing
β β βββ syntaxminds.py # SyntaxMinds content
β βββ utils/ # Utility functions
β βββ __init__.py # Helper functions for scrapers
βββ output/ # Generated CSV files
β βββ flipkart_latest_smartphones.csv
β βββ imdb.csv
β βββ github.csv
β βββ ...
βββ main.py # Main entry point
βββ README.md # This file
- requests: HTTP library for making web requests
- beautifulsoup4: HTML/XML parsing library
- lxml: XML and HTML processing library
- csv: Built-in CSV module for data export
We welcome contributions! Here's how you can help:
- Fork the repository
- Create a new scraper or improve existing ones
- Add proper documentation and comments
- Test your changes
- Submit a pull request
- Add new website scrapers
- Improve error handling
- Add data validation
- Create web interface
- Add support for different output formats (JSON, XML)
- Implement rate limiting and respect robots.txt
- Respect robots.txt: Always check the website's robots.txt file
- Rate Limiting: Add delays between requests to be respectful
- Terms of Service: Ensure you comply with each website's terms
- Data Usage: Use scraped data responsibly and ethically
This project is open source and available under the MIT License.
- Beautiful Soup for HTML parsing
- Requests library for HTTP handling
- All contributors who help improve this collection
If you have questions or need help:
- Open an issue on GitHub
- Check the code comments for implementation details
- Review the output files for expected data format
Happy Scraping! π·οΈβ¨