AI-powered web scraping platform that leverages Gemini AI to extract specific information from websites β handles dynamic content, CAPTCHAs, and provides clean JSON output for easy integration.
|
CyberYozh - Reliable SMS activation, residential proxies, and mobile proxies for multi-accounting. We have gathered the best solutions for multi-accounting and automation in one place. |
|
Thordata - Easy access to web data at scale, perfect for AI. A global network of 60M+ residential proxies with 99.7% availability, ensuring stable and reliable web data scraping to support AI, BI, and workflows. π Free Trial Available! Start with our free trial to experience reliable proxy infrastructure. π° Exclusive: Use code "THOR66" for 30% off your first purchase! π Register with invitation code "0HSUJ23G" or click here |
WebCrawlAI is an intelligent web scraping platform designed to help developers, researchers, and businesses extract specific information from websites with ease. The platform combines advanced web scraping capabilities with AI-powered data extraction to handle complex websites, dynamic content, and CAPTCHAs.
The platform features an AI-powered extraction engine that uses Google's Gemini AI model to precisely parse and extract requested information based on natural language prompts. Users can simply provide a URL and describe what data they need (e.g., "Extract all product names and prices") and receive clean, structured JSON output.
Built with modern web technologies, WebCrawlAI emphasizes reliability through robust error handling, retry mechanisms, and comprehensive monitoring. The platform is designed for both technical and non-technical users, providing a user-friendly web interface alongside a powerful API for integration into existing workflows.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
- Python (v3.8 or higher)
- pip package manager
- Bright Data Scraping Browser account (for SBR_WEBDRIVER)
- Google Gemini API Key (for AI-powered extraction)
-
Clone the repository
git clone https://github.com/ArjunCodess/WebCrawlAI.git cd WebCrawlAI -
Install dependencies
pip install -r requirements.txt
-
Set up environment variables Create a
.envfile and configure the required variables:SBR_WEBDRIVER="your_bright_data_scraping_browser_url" GEMINI_API_KEY="your_google_gemini_api_key"
-
Run the application
python main.py
The application will be available at http://localhost:5000 (default Flask port).
Currently, the project uses manual testing and user acceptance testing. Automated testing setup is planned for future releases.
-
Development Testing
- Run the development server with
python main.py - Test core features: web scraping, AI extraction, JSON output
- Verify error handling and retry mechanisms
- Run the development server with
-
Integration Testing
- Test with various website types (static, dynamic, with CAPTCHAs)
- Verify AI extraction accuracy with different prompts
- Test API endpoints and response formats
-
User Journey Testing
- Complete web interface workflow
- Test API integration
- Verify output format and accuracy
-
Web Scraping
- Handle static and dynamic websites
- Bypass CAPTCHAs and anti-bot measures
- Support for JavaScript-heavy sites
-
AI-Powered Extraction
- Natural language prompts for data extraction
- Precise parsing using Gemini AI
- Structured JSON output
-
Web Interface
- User-friendly interface for non-technical users
- Real-time extraction results
- Error handling and status updates
-
API Integration
- RESTful API for programmatic access
- Clean JSON responses
- Easy integration into existing workflows
-
Monitoring and Analytics
- Event tracking with GetAnalyzr
- Performance monitoring
- Usage analytics
- Access the web interface at the deployed URL
- Enter the target website URL
- Provide a clear extraction prompt (e.g., "Extract all product names and prices")
- Click "Extract Information"
- Review the structured JSON output
The project is configured for deployment on Render with the following setup:
-
Render Deployment
- Connect your repository to Render
- Configure environment variables in Render dashboard
- Deploy automatically on pushes to main branch
-
Required Environment Variables
SBR_WEBDRIVER="your_bright_data_scraping_browser_url" GEMINI_API_KEY="your_google_gemini_api_key" FLASK_ENV="production"
-
Service Configuration
- Configure as a Web Service on Render
- Set build command:
pip install -r requirements.txt - Set start command:
python main.py
-
Monitoring and Error Tracking
- GetAnalyzr integration for event tracking
- Built-in error handling and logging
- Performance monitoring capabilities
- Bright Data Scraping Browser: For reliable web scraping with CAPTCHA handling
- Google Gemini AI: For intelligent data extraction and parsing
- GetAnalyzr: For usage analytics and monitoring
Endpoint: /scrape-and-parse
Method: POST
Request Body (JSON):
{
"url": "https://www.example.com",
"parse_description": "Extract all product names and prices"
}Response (JSON):
Success:
{
"success": true,
"result": {
"products": [
{ "name": "Product A", "price": "$10" },
{ "name": "Product B", "price": "$20" }
]
}
}Error:
{
"error": "An error occurred during scraping or parsing"
}- Flask - Web Framework (v3.0.0)
- Python - Programming Language
- BeautifulSoup - HTML/XML Parser (v4.12.2)
- Selenium - Browser Automation (v4.16.0)
- lxml - Fast XML and HTML Processing
- html5lib - HTML Document Parser
- Bright Data Scraping Browser - Managed Browser Service
- Google Generative AI - Gemini AI Integration (v0.3.1)
- Vercel AI SDK - AI Integration Tools
- Tailwind CSS - Utility-First CSS Framework
- Axios - HTTP Client Library
- Marked - Markdown Parser
- Render - Deployment Platform
- python-dotenv - Environment Variables (v1.0.0)
- GetAnalyzr - Analytics and Event Tracking
- Waitress - WSGI Server
- ArjunCodess (Arjun Vijay Prakash) - Project development and maintenance
Note: This project embraces open-source values and transparency. We love open source because it keeps us accountable, fosters collaboration, and drives innovation. For collaboration opportunities or questions, please reach out through the appropriate channels.
- CyberYozh for providing reliable SMS activation and proxy solutions for multi-accounting and automation
- Thordata for powering our web scraping infrastructure with their global network of 60M+ residential proxies
- Google for providing the Gemini AI model that powers our intelligent extraction capabilities
- Bright Data for reliable scraping browser infrastructure
- Render for the excellent deployment platform
- Flask Team for the robust web framework
- Selenium for powerful browser automation capabilities
- Open Source Community for the countless libraries and tools that make modern web development possible
WebCrawlAI - Transforming web data into structured insights
Built with β€οΈ for developers and data enthusiasts

