Schema.org Crawler

A distributed web crawler designed to fetch and process schema.org structured data from websites at scale.

🚀 Quick Start

Deploy to Azure Kubernetes (Production)

Option 1: Complete Setup from Scratch

# This interactive script will:
# 1. Ask for resource group name and region
# 2. Create ALL Azure resources (AKS, Database, Queue, Storage, etc.)
# 3. Build and deploy everything
# 4. Give you the public URL (~15-20 minutes)
./azure/setup-and-deploy.sh

Option 2: Deploy to Existing Resources

If you already have Azure resources (database, queue, etc.):

# Step 1: Configure your environment
cp .env.example .env
# Edit .env with your Azure credentials

# Step 2: Create Kubernetes secrets
./azure/create-secrets-from-env.sh

# Step 3: Deploy to AKS
./azure/deploy-to-aks.sh

Access Your Deployment

After deployment completes:

# Get the public URL
kubectl get service crawler-master-external -n crawler

# Access the crawler
# Web UI: http://<EXTERNAL-IP>/
# API: http://<EXTERNAL-IP>/api/status

Create Stable URL (Optional)

# Create a static IP for stable URL
./azure/create-static-ip.sh

See Azure Deployment Guide for detailed instructions.

Local Development

# Start master service (API + Scheduler)
./start_master.sh

# Start worker service (in another terminal)
./start_worker.sh

📁 Project Structure

crawler/
├── azure/              # Azure deployment scripts
├── code/               # Source code
│   ├── core/          # Core crawler logic
│   └── tests/         # Unit tests
├── k8s/               # Kubernetes manifests
├── testing/           # Testing and monitoring scripts
├── data/              # Test data (git-ignored)
└── start_*.sh         # Production starter scripts

🏗️ Architecture

The crawler consists of:

Master Service: REST API and job scheduler
Worker Service(s): Process crawling jobs from queue
Azure Service Bus: Job queue
Azure SQL Database: Metadata and state
Azure Blob Storage: Raw data storage
Azure AI Search: Vector database for embeddings

🔧 Configuration

Create a .env file with your Azure credentials:

cp .env.example .env
# Edit .env with your Azure resources

Required environment variables:

AZURE_SERVICEBUS_NAMESPACE - Service Bus namespace
DB_SERVER, DB_DATABASE, DB_USERNAME, DB_PASSWORD - SQL Database
BLOB_STORAGE_ACCOUNT_NAME - Storage account
AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_KEY - AI Search (optional)

📊 API Endpoints

GET / - Web UI
GET /api/status - System status
POST /api/sites - Add site to crawl
GET /api/queue/status - Queue statistics
POST /api/process/{site_url} - Trigger manual processing

🧪 Testing

Run the complete test suite:

./testing/run_k8s_test.sh

See Testing Guide for more testing options.

🚢 Kubernetes Management

Common Commands

# View all resources
kubectl get all -n crawler

# Check pod status
kubectl get pods -n crawler

# View logs
kubectl logs -n crawler -l app=crawler-master -f    # Master logs
kubectl logs -n crawler -l app=crawler-worker -f    # Worker logs

# Scale workers
kubectl scale deployment crawler-worker -n crawler --replicas=10

# Restart deployments (after config changes)
kubectl rollout restart deployment/crawler-master -n crawler
kubectl rollout restart deployment/crawler-worker -n crawler

# Access pod shell for debugging
kubectl exec -it <pod-name> -n crawler -- /bin/bash

# Delete and redeploy
kubectl delete namespace crawler
./azure/deploy-to-aks.sh

Cost Management

# Stop AKS cluster (saves ~$60-200/month)
az aks stop --name <cluster-name> --resource-group <rg>

# Start again when needed
az aks start --name <cluster-name> --resource-group <rg>

Docker Compose (Development)

docker-compose up --build

📈 Monitoring

View logs: kubectl logs -n crawler -l app=crawler-master -f
Check status: curl http://<ip>/api/status
Monitor queue: python3 testing/monitoring/monitor_queue.py

📝 Documentation

📄 License

MIT License

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Run tests
Submit a pull request

🆘 Support

For issues or questions:

Check the documentation
Review testing scripts
Open an issue on GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
azure		azure
code		code
k8s		k8s
testing		testing
.env.example		.env.example
.gitignore		.gitignore
.test_api_key		.test_api_key
LICENSE		LICENSE
README.md		README.md
add_test_files.py		add_test_files.py
clean_and_migrate_database.py		clean_and_migrate_database.py
create_test_user.py		create_test_user.py
docker-compose.yml		docker-compose.yml
migrate_database.py		migrate_database.py
test_sites.py		test_sites.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Schema.org Crawler

🚀 Quick Start

Deploy to Azure Kubernetes (Production)

Option 1: Complete Setup from Scratch

Option 2: Deploy to Existing Resources

Access Your Deployment

Create Stable URL (Optional)

Local Development

📁 Project Structure

🏗️ Architecture

🔧 Configuration

📊 API Endpoints

🧪 Testing

🚢 Kubernetes Management

Common Commands

Cost Management

Docker Compose (Development)

📈 Monitoring

📝 Documentation

📄 License

🤝 Contributing

🆘 Support

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

nlweb-ai/crawler

Folders and files

Latest commit

History

Repository files navigation

Schema.org Crawler

🚀 Quick Start

Deploy to Azure Kubernetes (Production)

Option 1: Complete Setup from Scratch

Option 2: Deploy to Existing Resources

Access Your Deployment

Create Stable URL (Optional)

Local Development

📁 Project Structure

🏗️ Architecture

🔧 Configuration

📊 API Endpoints

🧪 Testing

🚢 Kubernetes Management

Common Commands

Cost Management

Docker Compose (Development)

📈 Monitoring

📝 Documentation

📄 License

🤝 Contributing

🆘 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages