Skip to content

Latest commit

 

History

History
113 lines (76 loc) · 4.09 KB

File metadata and controls

113 lines (76 loc) · 4.09 KB

GitHub AGENTS.md and CLAUDE.md Scraper

Discover and download AGENTS.md and CLAUDE.md files from GitHub repositories.

Python GitHub License GitHub Stars linting - Ruff

Note

For learning and inspiration. Downloaded files retain their original licenses—respect those terms.

What It Does

  1. get_repos.py — Find repos via GitHub Search API
  2. get_agentsmd.py — Download their AGENTS.md and CLAUDE.md files

Searches recent, non-archived GitHub repos sorted by stars (default: 50,000 repos max). Default language: Python. Configurable via config.yaml.

Installation

git clone https://github.com/yourusername/github-get-agents.git
cd github-get-agents
pip3 install uv && uv sync

Configuration

All settings are centralized in config.yaml. Edit this file to customize:

  • Repository search: Language, date ranges, star bins, max repos
  • API settings: Timeouts, retries, backoff strategies
  • Download settings: Delays, output directories

Default values work well for most use cases. CLI arguments override config values when specified.

GitHub Token

Create a Personal Access Token with repo and user:read:user permissions:

export GITHUB_TOKEN="ghp_..."

Usage

1. Discover Repositories

uv run python get_repos.py                 # Use defaults from config.yaml
uv run python get_repos.py -n 1000         # Limit to 1000 repos
uv run python get_repos.py --dry-run       # Preview query partitions without fetching

Output: repos_YYYY-MM-DD_HHMMSS.jsonl

2. Download AGENTS.md and CLAUDE.md Files

uv run python get_agentsmd.py              # Auto-detect newest repos file
uv run python get_agentsmd.py -w 8         # Use 8 parallel workers (faster)
uv run python get_agentsmd.py -r           # Resume interrupted download
uv run python get_agentsmd.py -r -w 8      # Resume with parallel workers

Output: agents_md_YYYY-MM-DD_HHMMSS/org/repo/AGENTS.md + download_results.jsonl

Troubleshooting

Issue Solution
ERROR: set GITHUB_TOKEN export GITHUB_TOKEN="..."
403 Forbidden Regenerate token with repo and user:read:user scopes
Rate limit Scripts auto-wait; run during off-peak hours for large jobs
Empty repos.jsonl Adjust filters in get_repos.py or verify token works

Verify token:

curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/user | jq -r .login

Scaling to more Repos

GitHub Search API returns max 1,000 results per query. To get more:

Method 1: Edit star bins in config.yaml to partition queries:

star_bins:
  - [10000, null]
  - [5000, 9999] # Uncomment for 5k-10k stars
  - [2000, 4999] # Uncomment for 2k-5k stars
  # ... more bins available in config

Method 2: Edit date ranges or other filters in config.yaml

Method 3: Use GitHub on BigQuery for exhaustive queries

API Limits

Resource Limit Notes
Search API 30 req/min Used by get_repos.py
File downloads N/A 0.1s delay in get_agentsmd.py

Both scripts handle rate limits with automatic retry and backoff.

License

MIT License