Skip to content

ArjunCodess/WebCrawlAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

WebCrawlAI - AI-Powered Web Scraping Platform

Status GitHub Issues GitHub Pull Requests License


AI-powered web scraping platform that leverages Gemini AI to extract specific information from websites β€” handles dynamic content, CAPTCHAs, and provides clean JSON output for easy integration.


πŸ† Sponsors

CyberYozh

CyberYozh Banner
CyberYozh Logo CyberYozh - Reliable SMS activation, residential proxies, and mobile proxies for multi-accounting.

We have gathered the best solutions for multi-accounting and automation in one place.

Thordata

Thordata Banner
Thordata Logo Thordata - Easy access to web data at scale, perfect for AI.

A global network of 60M+ residential proxies with 99.7% availability, ensuring stable and reliable web data scraping to support AI, BI, and workflows.

🎁 Free Trial Available! Start with our free trial to experience reliable proxy infrastructure.

πŸ’° Exclusive: Use code "THOR66" for 30% off your first purchase!
πŸ”— Register with invitation code "0HSUJ23G" or click here

πŸ“ Table of Contents

🧐 About

WebCrawlAI is an intelligent web scraping platform designed to help developers, researchers, and businesses extract specific information from websites with ease. The platform combines advanced web scraping capabilities with AI-powered data extraction to handle complex websites, dynamic content, and CAPTCHAs.

The platform features an AI-powered extraction engine that uses Google's Gemini AI model to precisely parse and extract requested information based on natural language prompts. Users can simply provide a URL and describe what data they need (e.g., "Extract all product names and prices") and receive clean, structured JSON output.

Built with modern web technologies, WebCrawlAI emphasizes reliability through robust error handling, retry mechanisms, and comprehensive monitoring. The platform is designed for both technical and non-technical users, providing a user-friendly web interface alongside a powerful API for integration into existing workflows.

🏁 Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

  • Python (v3.8 or higher)
  • pip package manager
  • Bright Data Scraping Browser account (for SBR_WEBDRIVER)
  • Google Gemini API Key (for AI-powered extraction)

Installing

  1. Clone the repository

    git clone https://github.com/ArjunCodess/WebCrawlAI.git
    cd WebCrawlAI
  2. Install dependencies

    pip install -r requirements.txt
  3. Set up environment variables Create a .env file and configure the required variables:

    SBR_WEBDRIVER="your_bright_data_scraping_browser_url"
    GEMINI_API_KEY="your_google_gemini_api_key"
  4. Run the application

    python main.py

The application will be available at http://localhost:5000 (default Flask port).

πŸ”§ Running the tests

Currently, the project uses manual testing and user acceptance testing. Automated testing setup is planned for future releases.

Manual Testing

  1. Development Testing

    • Run the development server with python main.py
    • Test core features: web scraping, AI extraction, JSON output
    • Verify error handling and retry mechanisms
  2. Integration Testing

    • Test with various website types (static, dynamic, with CAPTCHAs)
    • Verify AI extraction accuracy with different prompts
    • Test API endpoints and response formats
  3. User Journey Testing

    • Complete web interface workflow
    • Test API integration
    • Verify output format and accuracy

🎈 Usage

Core Features

  1. Web Scraping

    • Handle static and dynamic websites
    • Bypass CAPTCHAs and anti-bot measures
    • Support for JavaScript-heavy sites
  2. AI-Powered Extraction

    • Natural language prompts for data extraction
    • Precise parsing using Gemini AI
    • Structured JSON output
  3. Web Interface

    • User-friendly interface for non-technical users
    • Real-time extraction results
    • Error handling and status updates
  4. API Integration

    • RESTful API for programmatic access
    • Clean JSON responses
    • Easy integration into existing workflows
  5. Monitoring and Analytics

    • Event tracking with GetAnalyzr
    • Performance monitoring
    • Usage analytics

Getting Started Workflow

  1. Access the web interface at the deployed URL
  2. Enter the target website URL
  3. Provide a clear extraction prompt (e.g., "Extract all product names and prices")
  4. Click "Extract Information"
  5. Review the structured JSON output

πŸš€ Deployment

The project is configured for deployment on Render with the following setup:

Production Deployment

  1. Render Deployment

    • Connect your repository to Render
    • Configure environment variables in Render dashboard
    • Deploy automatically on pushes to main branch
  2. Required Environment Variables

    SBR_WEBDRIVER="your_bright_data_scraping_browser_url"
    GEMINI_API_KEY="your_google_gemini_api_key"
    FLASK_ENV="production"
  3. Service Configuration

    • Configure as a Web Service on Render
    • Set build command: pip install -r requirements.txt
    • Set start command: python main.py
  4. Monitoring and Error Tracking

    • GetAnalyzr integration for event tracking
    • Built-in error handling and logging
    • Performance monitoring capabilities

Additional Services

  • Bright Data Scraping Browser: For reliable web scraping with CAPTCHA handling
  • Google Gemini AI: For intelligent data extraction and parsing
  • GetAnalyzr: For usage analytics and monitoring

πŸ“š API Documentation

Endpoint: /scrape-and-parse

Method: POST

Request Body (JSON):

{
  "url": "https://www.example.com",
  "parse_description": "Extract all product names and prices"
}

Response (JSON):

Success:

{
  "success": true,
  "result": {
    "products": [
      { "name": "Product A", "price": "$10" },
      { "name": "Product B", "price": "$20" }
    ]
  }
}

Error:

{
  "error": "An error occurred during scraping or parsing"
}

⛏️ Built Using

Core Framework

Web Scraping & Automation

AI & Machine Learning

Frontend & UI

Development & Deployment

Additional Libraries

✍️ Authors

  • ArjunCodess (Arjun Vijay Prakash) - Project development and maintenance

Note: This project embraces open-source values and transparency. We love open source because it keeps us accountable, fosters collaboration, and drives innovation. For collaboration opportunities or questions, please reach out through the appropriate channels.

πŸŽ‰ Acknowledgements

Sponsors

  • CyberYozh for providing reliable SMS activation and proxy solutions for multi-accounting and automation
  • Thordata for powering our web scraping infrastructure with their global network of 60M+ residential proxies

Technology Partners

  • Google for providing the Gemini AI model that powers our intelligent extraction capabilities
  • Bright Data for reliable scraping browser infrastructure
  • Render for the excellent deployment platform
  • Flask Team for the robust web framework
  • Selenium for powerful browser automation capabilities
  • Open Source Community for the countless libraries and tools that make modern web development possible

WebCrawlAI - Transforming web data into structured insights

Built with ❀️ for developers and data enthusiasts