|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Development Commands |
| 6 | + |
| 7 | +**Local Development:** |
| 8 | +- `./start.sh [MAX_WORKERS]` - Run the complete producer-consumer pipeline locally |
| 9 | + - MAX_WORKERS defaults to 3 if not provided |
| 10 | + - Set `ORG_NAME` environment variable to override target organization |
| 11 | + - Uses RCC for environment isolation and task execution |
| 12 | + |
| 13 | +**Individual Task Execution:** |
| 14 | +- `rcc run -t "Producer" -e devdata/env-for-producer.json` - Run producer task only |
| 15 | +- `rcc run -t "Consumer" -e devdata/env-for-consumer.json` - Run consumer task only |
| 16 | +- `rcc run -t "Reporter" -e devdata/env-for-reporter.json` - Run reporter task only |
| 17 | + |
| 18 | +**Testing:** |
| 19 | +- `python -m pytest tests/` - Run test suite |
| 20 | +- `python -m pytest tests/test_generate_shards_and_matrix.py` - Run specific test |
| 21 | + |
| 22 | +**Matrix Generation:** |
| 23 | +- `python scripts/generate_shards_and_matrix.py <MAX_WORKERS>` - Generate shards and GitHub Actions matrix |
| 24 | + |
| 25 | +## Architecture Overview |
| 26 | + |
| 27 | +This is a **Robocorp RPA bot** implementing a producer-consumer pattern with matrix sharding for GitHub repository processing at scale. The architecture follows a three-stage pipeline: |
| 28 | + |
| 29 | +### Core Components |
| 30 | + |
| 31 | +**1. Producer (`tasks.py:producer()`):** |
| 32 | +- Fetches repository metadata from GitHub organizations using `scripts/fetch_repos.py` |
| 33 | +- Creates work items for each repository |
| 34 | +- Outputs to `output/producer-to-consumer/work-items.json` |
| 35 | + |
| 36 | +**2. Consumer (`tasks.py:consumer()`):** |
| 37 | +- Processes repository work items by cloning repos |
| 38 | +- Uses GitPython for shallow clones with error handling |
| 39 | +- Creates ZIP archives of cloned repositories |
| 40 | +- Supports sharding for parallel execution |
| 41 | +- Outputs processing reports and ZIP files |
| 42 | + |
| 43 | +**3. Reporter (`tasks.py:reporter()`):** |
| 44 | +- Aggregates results from all consumer shards |
| 45 | +- Generates comprehensive processing statistics |
| 46 | +- Creates final reports with success rates and error details |
| 47 | + |
| 48 | +### Key Architectural Patterns |
| 49 | + |
| 50 | +**Matrix Sharding:** |
| 51 | +- `scripts/generate_shards_and_matrix.py` splits work items across parallel workers |
| 52 | +- Each shard gets processed by independent consumer instances |
| 53 | +- Enables massive parallelism in GitHub Actions workflows |
| 54 | + |
| 55 | +**Environment Isolation:** |
| 56 | +- All tasks run in RCC-managed environments defined by `robot.yaml` and `conda.yaml` |
| 57 | +- Environment is automatically rebuilt only when dependencies change |
| 58 | +- Supports local development and CI/CD execution |
| 59 | + |
| 60 | +**Work Item Flow:** |
| 61 | +``` |
| 62 | +Producer → work-items.json → Sharding → [work-items-shard-0.json, work-items-shard-1.json, ...] → Consumer instances → Reporter |
| 63 | +``` |
| 64 | + |
| 65 | +**Error Handling:** |
| 66 | +- Network errors result in "released" status for retry |
| 67 | +- Git errors are categorized and handled appropriately |
| 68 | +- Comprehensive error logging and status tracking |
| 69 | + |
| 70 | +## Project Structure Notes |
| 71 | + |
| 72 | +- `robot.yaml` - Robocorp task definitions and RCC configuration |
| 73 | +- `conda.yaml` - Python environment specification managed by RCC |
| 74 | +- `pyproject.toml` - Robocorp logging configuration only |
| 75 | +- `devdata/` - Environment configurations for each task (producer, consumer, reporter) |
| 76 | +- `scripts/` - Utility scripts for repo fetching, sharding, and GitHub Actions integration |
| 77 | +- `output/` - Task outputs, ZIP archives, and processing reports |
| 78 | +- `.github/workflows/` - GitHub Actions workflows for matrix-based parallel execution |
| 79 | + |
| 80 | +## Development Notes |
| 81 | + |
| 82 | +- **RCC Dependency:** All task execution requires RCC (Robocorp Control Room) for environment management |
| 83 | +- **Sharding Logic:** The system automatically adjusts shard count based on available work items |
| 84 | +- **Repository Cleanup:** Consumer tasks automatically clean up cloned repositories after ZIP creation |
| 85 | +- **Idempotency:** Tasks are designed to be rerunnable without side effects |
| 86 | +- **GitHub Rate Limits:** Uses authenticated GitHub API calls (configure with appropriate tokens) |
| 87 | + |
| 88 | +## GitHub Actions Integration |
| 89 | + |
| 90 | +The project includes workflows for: |
| 91 | +- Building custom runner Docker images with pre-installed dependencies |
| 92 | +- Matrix-based parallel execution across multiple GitHub Actions runners |
| 93 | +- Both hosted and self-hosted runner support |
| 94 | +- Automatic Docker image rebuilds when environment files change |
0 commit comments