Skip to content

Commit 916bc76

Browse files
committed
feat: add guidance for CLAUDE bot and enhance repo root detection in fetch_repos.py
1 parent cc70153 commit 916bc76

File tree

2 files changed

+108
-0
lines changed

2 files changed

+108
-0
lines changed

CLAUDE.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Development Commands
6+
7+
**Local Development:**
8+
- `./start.sh [MAX_WORKERS]` - Run the complete producer-consumer pipeline locally
9+
- MAX_WORKERS defaults to 3 if not provided
10+
- Set `ORG_NAME` environment variable to override target organization
11+
- Uses RCC for environment isolation and task execution
12+
13+
**Individual Task Execution:**
14+
- `rcc run -t "Producer" -e devdata/env-for-producer.json` - Run producer task only
15+
- `rcc run -t "Consumer" -e devdata/env-for-consumer.json` - Run consumer task only
16+
- `rcc run -t "Reporter" -e devdata/env-for-reporter.json` - Run reporter task only
17+
18+
**Testing:**
19+
- `python -m pytest tests/` - Run test suite
20+
- `python -m pytest tests/test_generate_shards_and_matrix.py` - Run specific test
21+
22+
**Matrix Generation:**
23+
- `python scripts/generate_shards_and_matrix.py <MAX_WORKERS>` - Generate shards and GitHub Actions matrix
24+
25+
## Architecture Overview
26+
27+
This is a **Robocorp RPA bot** implementing a producer-consumer pattern with matrix sharding for GitHub repository processing at scale. The architecture follows a three-stage pipeline:
28+
29+
### Core Components
30+
31+
**1. Producer (`tasks.py:producer()`):**
32+
- Fetches repository metadata from GitHub organizations using `scripts/fetch_repos.py`
33+
- Creates work items for each repository
34+
- Outputs to `output/producer-to-consumer/work-items.json`
35+
36+
**2. Consumer (`tasks.py:consumer()`):**
37+
- Processes repository work items by cloning repos
38+
- Uses GitPython for shallow clones with error handling
39+
- Creates ZIP archives of cloned repositories
40+
- Supports sharding for parallel execution
41+
- Outputs processing reports and ZIP files
42+
43+
**3. Reporter (`tasks.py:reporter()`):**
44+
- Aggregates results from all consumer shards
45+
- Generates comprehensive processing statistics
46+
- Creates final reports with success rates and error details
47+
48+
### Key Architectural Patterns
49+
50+
**Matrix Sharding:**
51+
- `scripts/generate_shards_and_matrix.py` splits work items across parallel workers
52+
- Each shard gets processed by independent consumer instances
53+
- Enables massive parallelism in GitHub Actions workflows
54+
55+
**Environment Isolation:**
56+
- All tasks run in RCC-managed environments defined by `robot.yaml` and `conda.yaml`
57+
- Environment is automatically rebuilt only when dependencies change
58+
- Supports local development and CI/CD execution
59+
60+
**Work Item Flow:**
61+
```
62+
Producer → work-items.json → Sharding → [work-items-shard-0.json, work-items-shard-1.json, ...] → Consumer instances → Reporter
63+
```
64+
65+
**Error Handling:**
66+
- Network errors result in "released" status for retry
67+
- Git errors are categorized and handled appropriately
68+
- Comprehensive error logging and status tracking
69+
70+
## Project Structure Notes
71+
72+
- `robot.yaml` - Robocorp task definitions and RCC configuration
73+
- `conda.yaml` - Python environment specification managed by RCC
74+
- `pyproject.toml` - Robocorp logging configuration only
75+
- `devdata/` - Environment configurations for each task (producer, consumer, reporter)
76+
- `scripts/` - Utility scripts for repo fetching, sharding, and GitHub Actions integration
77+
- `output/` - Task outputs, ZIP archives, and processing reports
78+
- `.github/workflows/` - GitHub Actions workflows for matrix-based parallel execution
79+
80+
## Development Notes
81+
82+
- **RCC Dependency:** All task execution requires RCC (Robocorp Control Room) for environment management
83+
- **Sharding Logic:** The system automatically adjusts shard count based on available work items
84+
- **Repository Cleanup:** Consumer tasks automatically clean up cloned repositories after ZIP creation
85+
- **Idempotency:** Tasks are designed to be rerunnable without side effects
86+
- **GitHub Rate Limits:** Uses authenticated GitHub API calls (configure with appropriate tokens)
87+
88+
## GitHub Actions Integration
89+
90+
The project includes workflows for:
91+
- Building custom runner Docker images with pre-installed dependencies
92+
- Matrix-based parallel execution across multiple GitHub Actions runners
93+
- Both hosted and self-hosted runner support
94+
- Automatic Docker image rebuilds when environment files change

scripts/fetch_repos.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,20 @@
55
# Timeout for HTTP requests (in seconds)
66
REQUEST_TIMEOUT = 10
77

8+
def get_repo_root() -> Path:
9+
"""
10+
Find the root directory of the repository by looking for robot.yaml.
11+
12+
Returns:
13+
Path: The root directory of the repository
14+
"""
15+
current = Path(__file__).resolve()
16+
for parent in [current] + list(current.parents):
17+
if (parent / 'robot.yaml').exists():
18+
return parent
19+
# Fallback to current file's parent directory
20+
return Path(__file__).parent.parent
21+
822
def fetch_github_repos(entity: str, entity_type: str = None, write_csv: bool = False) -> DataFrame:
923
"""
1024
Fetch public repositories from a GitHub organization or user.

0 commit comments

Comments
 (0)