Skip to content

nlweb-ai/crawler

Repository files navigation

Schema.org Crawler

A distributed web crawler designed to fetch and process schema.org structured data from websites at scale.

πŸš€ Quick Start

Deploy to Azure Kubernetes (Production)

Option 1: Complete Setup from Scratch

# This interactive script will:
# 1. Ask for resource group name and region
# 2. Create ALL Azure resources (AKS, Database, Queue, Storage, etc.)
# 3. Build and deploy everything
# 4. Give you the public URL (~15-20 minutes)
./azure/setup-and-deploy.sh

Option 2: Deploy to Existing Resources

If you already have Azure resources (database, queue, etc.):

# Step 1: Configure your environment
cp .env.example .env
# Edit .env with your Azure credentials

# Step 2: Create Kubernetes secrets
./azure/create-secrets-from-env.sh

# Step 3: Deploy to AKS
./azure/deploy-to-aks.sh

Access Your Deployment

After deployment completes:

# Get the public URL
kubectl get service crawler-master-external -n crawler

# Access the crawler
# Web UI: http://<EXTERNAL-IP>/
# API: http://<EXTERNAL-IP>/api/status

Create Stable URL (Optional)

# Create a static IP for stable URL
./azure/create-static-ip.sh

See Azure Deployment Guide for detailed instructions.

Local Development

# Start master service (API + Scheduler)
./start_master.sh

# Start worker service (in another terminal)
./start_worker.sh

πŸ“ Project Structure

crawler/
β”œβ”€β”€ azure/              # Azure deployment scripts
β”œβ”€β”€ code/               # Source code
β”‚   β”œβ”€β”€ core/          # Core crawler logic
β”‚   └── tests/         # Unit tests
β”œβ”€β”€ k8s/               # Kubernetes manifests
β”œβ”€β”€ testing/           # Testing and monitoring scripts
β”œβ”€β”€ data/              # Test data (git-ignored)
└── start_*.sh         # Production starter scripts

πŸ—οΈ Architecture

The crawler consists of:

  • Master Service: REST API and job scheduler
  • Worker Service(s): Process crawling jobs from queue
  • Azure Service Bus: Job queue
  • Azure SQL Database: Metadata and state
  • Azure Blob Storage: Raw data storage
  • Azure AI Search: Vector database for embeddings

πŸ”§ Configuration

Create a .env file with your Azure credentials:

cp .env.example .env
# Edit .env with your Azure resources

Required environment variables:

  • AZURE_SERVICEBUS_NAMESPACE - Service Bus namespace
  • DB_SERVER, DB_DATABASE, DB_USERNAME, DB_PASSWORD - SQL Database
  • BLOB_STORAGE_ACCOUNT_NAME - Storage account
  • AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_KEY - AI Search (optional)

πŸ“Š API Endpoints

  • GET / - Web UI
  • GET /api/status - System status
  • POST /api/sites - Add site to crawl
  • GET /api/queue/status - Queue statistics
  • POST /api/process/{site_url} - Trigger manual processing

πŸ§ͺ Testing

Run the complete test suite:

./testing/run_k8s_test.sh

See Testing Guide for more testing options.

🚒 Kubernetes Management

Common Commands

# View all resources
kubectl get all -n crawler

# Check pod status
kubectl get pods -n crawler

# View logs
kubectl logs -n crawler -l app=crawler-master -f    # Master logs
kubectl logs -n crawler -l app=crawler-worker -f    # Worker logs

# Scale workers
kubectl scale deployment crawler-worker -n crawler --replicas=10

# Restart deployments (after config changes)
kubectl rollout restart deployment/crawler-master -n crawler
kubectl rollout restart deployment/crawler-worker -n crawler

# Access pod shell for debugging
kubectl exec -it <pod-name> -n crawler -- /bin/bash

# Delete and redeploy
kubectl delete namespace crawler
./azure/deploy-to-aks.sh

Cost Management

# Stop AKS cluster (saves ~$60-200/month)
az aks stop --name <cluster-name> --resource-group <rg>

# Start again when needed
az aks start --name <cluster-name> --resource-group <rg>

Docker Compose (Development)

docker-compose up --build

πŸ“ˆ Monitoring

  • View logs: kubectl logs -n crawler -l app=crawler-master -f
  • Check status: curl http://<ip>/api/status
  • Monitor queue: python3 testing/monitoring/monitor_queue.py

πŸ“ Documentation

πŸ“„ License

MIT License

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests
  5. Submit a pull request

πŸ†˜ Support

For issues or questions:

About

Simple crawler for NLWeb

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •