WebCrawlAI - AI-Powered Web Scraping Platform

AI-powered web scraping platform that leverages Gemini AI to extract specific information from websites — handles dynamic content, CAPTCHAs, and provides clean JSON output for easy integration.

🏆 Sponsors

CyberYozh

CyberYozh - Reliable SMS activation, residential proxies, and mobile proxies for multi-accounting.

We have gathered the best solutions for multi-accounting and automation in one place.

Thordata

Thordata - Easy access to web data at scale, perfect for AI.

A global network of 60M+ residential proxies with 99.7% availability, ensuring stable and reliable web data scraping to support AI, BI, and workflows.

🎁 Free Trial Available! Start with our free trial to experience reliable proxy infrastructure.

💰 Exclusive: Use code "THOR66" for 30% off your first purchase!
🔗 Register with invitation code "0HSUJ23G" or click here

🧐 About

WebCrawlAI is an intelligent web scraping platform designed to help developers, researchers, and businesses extract specific information from websites with ease. The platform combines advanced web scraping capabilities with AI-powered data extraction to handle complex websites, dynamic content, and CAPTCHAs.

The platform features an AI-powered extraction engine that uses Google's Gemini AI model to precisely parse and extract requested information based on natural language prompts. Users can simply provide a URL and describe what data they need (e.g., "Extract all product names and prices") and receive clean, structured JSON output.

Built with modern web technologies, WebCrawlAI emphasizes reliability through robust error handling, retry mechanisms, and comprehensive monitoring. The platform is designed for both technical and non-technical users, providing a user-friendly web interface alongside a powerful API for integration into existing workflows.

🏁 Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

Python (v3.8 or higher)
pip package manager
Bright Data Scraping Browser account (for SBR_WEBDRIVER)
Google Gemini API Key (for AI-powered extraction)

Installing

Clone the repository

git clone https://github.com/ArjunCodess/WebCrawlAI.git
cd WebCrawlAI

Install dependencies
```
pip install -r requirements.txt
```

Set up environment variables Create a .env file and configure the required variables:

SBR_WEBDRIVER="your_bright_data_scraping_browser_url"
GEMINI_API_KEY="your_google_gemini_api_key"

Run the application
```
python main.py
```

The application will be available at http://localhost:5000 (default Flask port).

🔧 Running the tests

Currently, the project uses manual testing and user acceptance testing. Automated testing setup is planned for future releases.

Manual Testing

Development Testing
- Run the development server with python main.py
- Test core features: web scraping, AI extraction, JSON output
- Verify error handling and retry mechanisms
Integration Testing
- Test with various website types (static, dynamic, with CAPTCHAs)
- Verify AI extraction accuracy with different prompts
- Test API endpoints and response formats
User Journey Testing
- Complete web interface workflow
- Test API integration
- Verify output format and accuracy

🎈 Usage

Core Features

Web Scraping
- Handle static and dynamic websites
- Bypass CAPTCHAs and anti-bot measures
- Support for JavaScript-heavy sites
AI-Powered Extraction
- Natural language prompts for data extraction
- Precise parsing using Gemini AI
- Structured JSON output
Web Interface
- User-friendly interface for non-technical users
- Real-time extraction results
- Error handling and status updates
API Integration
- RESTful API for programmatic access
- Clean JSON responses
- Easy integration into existing workflows
Monitoring and Analytics
- Event tracking with GetAnalyzr
- Performance monitoring
- Usage analytics

Getting Started Workflow

Access the web interface at the deployed URL
Enter the target website URL
Provide a clear extraction prompt (e.g., "Extract all product names and prices")
Click "Extract Information"
Review the structured JSON output

🚀 Deployment

The project is configured for deployment on Render with the following setup:

Production Deployment

Render Deployment
- Connect your repository to Render
- Configure environment variables in Render dashboard
- Deploy automatically on pushes to main branch

Required Environment Variables

SBR_WEBDRIVER="your_bright_data_scraping_browser_url"
GEMINI_API_KEY="your_google_gemini_api_key"
FLASK_ENV="production"

Service Configuration
- Configure as a Web Service on Render
- Set build command: pip install -r requirements.txt
- Set start command: python main.py
Monitoring and Error Tracking
- GetAnalyzr integration for event tracking
- Built-in error handling and logging
- Performance monitoring capabilities

Additional Services

Bright Data Scraping Browser: For reliable web scraping with CAPTCHA handling
Google Gemini AI: For intelligent data extraction and parsing
GetAnalyzr: For usage analytics and monitoring

📚 API Documentation

Endpoint: /scrape-and-parse

Method: POST

Request Body (JSON):

{
  "url": "https://www.example.com",
  "parse_description": "Extract all product names and prices"
}

Response (JSON):

Success:

{
  "success": true,
  "result": {
    "products": [
      { "name": "Product A", "price": "$10" },
      { "name": "Product B", "price": "$20" }
    ]
  }
}

Error:

{
  "error": "An error occurred during scraping or parsing"
}

⛏️ Built Using

Core Framework

Flask - Web Framework (v3.0.0)
Python - Programming Language
BeautifulSoup - HTML/XML Parser (v4.12.2)

Web Scraping & Automation

Selenium - Browser Automation (v4.16.0)
lxml - Fast XML and HTML Processing
html5lib - HTML Document Parser
Bright Data Scraping Browser - Managed Browser Service

AI & Machine Learning

Google Generative AI - Gemini AI Integration (v0.3.1)
Vercel AI SDK - AI Integration Tools

Frontend & UI

Tailwind CSS - Utility-First CSS Framework
Axios - HTTP Client Library
Marked - Markdown Parser

Development & Deployment

Render - Deployment Platform
python-dotenv - Environment Variables (v1.0.0)
GetAnalyzr - Analytics and Event Tracking

Additional Libraries

Waitress - WSGI Server

✍️ Authors

ArjunCodess (Arjun Vijay Prakash) - Project development and maintenance

Note: This project embraces open-source values and transparency. We love open source because it keeps us accountable, fosters collaboration, and drives innovation. For collaboration opportunities or questions, please reach out through the appropriate channels.

🎉 Acknowledgements

Technology Partners

Google for providing the Gemini AI model that powers our intelligent extraction capabilities
Bright Data for reliable scraping browser infrastructure
Render for the excellent deployment platform
Flask Team for the robust web framework
Selenium for powerful browser automation capabilities
Open Source Community for the countless libraries and tools that make modern web development possible

WebCrawlAI - Transforming web data into structured insights

Built with ❤️ for developers and data enthusiasts

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
static		static
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebCrawlAI - AI-Powered Web Scraping Platform

🏆 Sponsors

CyberYozh

Thordata

📝 Table of Contents

🧐 About

🏁 Getting Started

Prerequisites

Installing

🔧 Running the tests

Manual Testing

🎈 Usage

Core Features

Getting Started Workflow

🚀 Deployment

Production Deployment

Additional Services

📚 API Documentation

⛏️ Built Using

Core Framework

Web Scraping & Automation

AI & Machine Learning

Frontend & UI

Development & Deployment

Additional Libraries

✍️ Authors

🎉 Acknowledgements

Sponsors

Technology Partners

About

Uh oh!

Releases

Packages

Languages

License

ArjunCodess/WebCrawlAI

Folders and files

Latest commit

History

Repository files navigation

WebCrawlAI - AI-Powered Web Scraping Platform

🏆 Sponsors

CyberYozh

Thordata

📝 Table of Contents

🧐 About

🏁 Getting Started

Prerequisites

Installing

🔧 Running the tests

Manual Testing

🎈 Usage

Core Features

Getting Started Workflow

🚀 Deployment

Production Deployment

Additional Services

📚 API Documentation

⛏️ Built Using

Core Framework

Web Scraping & Automation

AI & Machine Learning

Frontend & UI

Development & Deployment

Additional Libraries

✍️ Authors

🎉 Acknowledgements

Sponsors

Technology Partners

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages