A distributed web crawler designed to fetch and process schema.org structured data from websites at scale.
# This interactive script will:
# 1. Ask for resource group name and region
# 2. Create ALL Azure resources (AKS, Database, Queue, Storage, etc.)
# 3. Build and deploy everything
# 4. Give you the public URL (~15-20 minutes)
./azure/setup-and-deploy.shIf you already have Azure resources (database, queue, etc.):
# Step 1: Configure your environment
cp .env.example .env
# Edit .env with your Azure credentials
# Step 2: Create Kubernetes secrets
./azure/create-secrets-from-env.sh
# Step 3: Deploy to AKS
./azure/deploy-to-aks.shAfter deployment completes:
# Get the public URL
kubectl get service crawler-master-external -n crawler
# Access the crawler
# Web UI: http://<EXTERNAL-IP>/
# API: http://<EXTERNAL-IP>/api/status# Create a static IP for stable URL
./azure/create-static-ip.shSee Azure Deployment Guide for detailed instructions.
# Start master service (API + Scheduler)
./start_master.sh
# Start worker service (in another terminal)
./start_worker.shcrawler/
βββ azure/ # Azure deployment scripts
βββ code/ # Source code
β βββ core/ # Core crawler logic
β βββ tests/ # Unit tests
βββ k8s/ # Kubernetes manifests
βββ testing/ # Testing and monitoring scripts
βββ data/ # Test data (git-ignored)
βββ start_*.sh # Production starter scripts
The crawler consists of:
- Master Service: REST API and job scheduler
- Worker Service(s): Process crawling jobs from queue
- Azure Service Bus: Job queue
- Azure SQL Database: Metadata and state
- Azure Blob Storage: Raw data storage
- Azure AI Search: Vector database for embeddings
Create a .env file with your Azure credentials:
cp .env.example .env
# Edit .env with your Azure resourcesRequired environment variables:
AZURE_SERVICEBUS_NAMESPACE- Service Bus namespaceDB_SERVER,DB_DATABASE,DB_USERNAME,DB_PASSWORD- SQL DatabaseBLOB_STORAGE_ACCOUNT_NAME- Storage accountAZURE_SEARCH_ENDPOINT,AZURE_SEARCH_KEY- AI Search (optional)
GET /- Web UIGET /api/status- System statusPOST /api/sites- Add site to crawlGET /api/queue/status- Queue statisticsPOST /api/process/{site_url}- Trigger manual processing
Run the complete test suite:
./testing/run_k8s_test.shSee Testing Guide for more testing options.
# View all resources
kubectl get all -n crawler
# Check pod status
kubectl get pods -n crawler
# View logs
kubectl logs -n crawler -l app=crawler-master -f # Master logs
kubectl logs -n crawler -l app=crawler-worker -f # Worker logs
# Scale workers
kubectl scale deployment crawler-worker -n crawler --replicas=10
# Restart deployments (after config changes)
kubectl rollout restart deployment/crawler-master -n crawler
kubectl rollout restart deployment/crawler-worker -n crawler
# Access pod shell for debugging
kubectl exec -it <pod-name> -n crawler -- /bin/bash
# Delete and redeploy
kubectl delete namespace crawler
./azure/deploy-to-aks.sh# Stop AKS cluster (saves ~$60-200/month)
az aks stop --name <cluster-name> --resource-group <rg>
# Start again when needed
az aks start --name <cluster-name> --resource-group <rg>docker-compose up --build- View logs:
kubectl logs -n crawler -l app=crawler-master -f - Check status:
curl http://<ip>/api/status - Monitor queue:
python3 testing/monitoring/monitor_queue.py
- Azure Deployment Guide
- Kubernetes Deployment
- Local Testing Guide
- API Documentation
- Testing Documentation
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests
- Submit a pull request
For issues or questions:
- Check the documentation
- Review testing scripts
- Open an issue on GitHub