This workspace contains a Node.js utility that scrapes hyperscaler documentation (AWS Bedrock, Google Vertex AI, Azure AI Foundry) to keep track of where closed-source frontier models are hosted in Europe and how those locations score on sustainability metrics. The output is a single JSON file that you can feed into reports, dashboards, or downstream scripts.
npm install
npm run data:sync # fetch OWID carbon, renewables & nuclear datasets
npm run scrapeThe default run writes data/hosting-regions.json. Pass --pretty for formatted JSON and --provider to scope the run (for example npm run scrape -- --provider aws,gcp). An optional AWS Bedrock Toolkit provider is available (aws-toolkit) if you want to validate Anthropic availability through the AWS SDK; see AWS Bedrock toolkit provider below for credentials.
- Playwright HTTP client fetches the public docs without launching a browser (so it can be used in a GitHub Action).
- Cheerio parses the documentation tables/lists for the model/region combinations we care about.
- Region metadata enriches each availability entry with carbon intensity, CFE % (if Google discloses it), and tags for nuclear/sustainable regions. The values come from the 2024 Our World in Data datasets (carbon intensity, renewable share, nuclear share) highlighted in the project brief. Each dataset credits the upstream sources that OWID cites (e.g., Ember & Energy Institute 2025 for carbon intensity, Ember & IEA for renewables, and Ember for nuclear share).
- Zod validates the final JSON payload before it gets written so CI can fail early if the upstream markup changes.
- AWS Bedrock Toolkit (optional) taps into the AWS SDK (
aws-toolkitprovider) to list Anthropic models per EU region and detect whether a row is limited to inference profiles. This is handy when AWS removes the*from their docs but the underlying API still exposes inference mode differences.
{
"generatedAt": "2025-01-12T10:15:00.123Z",
"providers": [
{
"id": "aws",
"cloud": "aws",
"service": "bedrock",
"dataset": "Amazon Bedrock model availability (EU focus)",
"sourceUrls": ["https://docs.aws.amazon.com/bedrock/latest/userguide/models-regions.html"],
"models": [
{
"vendor": "Anthropic",
"model": "Claude Sonnet 4.5",
"slug": "aws-bedrock-anthropic-claude-sonnet-4-5",
"regions": [
{
"regionKey": "aws:eu-central-1",
"status": "available",
"notes": ["Available via cross-region inference within the same AWS geography."]
}
]
}
]
},
"..."
],
"regionMetadata": {
"aws:eu-central-1": {
"displayName": "Europe (Frankfurt)",
"metrics": {
"carbonIntensity": 344.1,
"carbonUnits": "gCO2/kWh",
"cleanEnergyShare": 0.52
},
"tags": ["aws", "eu"]
}
}
}
The script is designed to run headlessly, so you can trigger a periodic GitHub Action to refresh the dataset or deploy the UI. This repo ships with .github/workflows/deploy-pages.yml, which:
- Installs dependencies.
- Downloads the OWID sustainability datasets.
- Runs
npm run scrape(copying the JSON intodocs/data). - Publishes the
docs/directory to GitHub Pages (via the built-inpagesdeployment).
To enable it:
- In GitHub → Settings → Pages, choose GitHub Actions as the build source.
- (Optional) Store
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEYas repository secrets if you want the workflow to run theaws-toolkitprovider; otherwise it will just use the default doc scrapers. - Push to
main(or trigger the workflow manually) to build and publish the dashboard. Subsequent pushes re-generate the dataset and redeploy the site automatically.
- The
docs/directory contains a vanilla HTML/CSS/JS dashboard that fetchesdocs/data/hosting-regions.json, lets you filter/sort/group entries, and highlights sustainability tags. - After every
npm run scrape, the dataset is automatically copied intodocs/dataso the UI stays current. You can also runnpm run web:syncmanually if needed. - To publish the UI on GitHub Pages, either enable the included GitHub Actions workflow (
deploy-pages.yml, described above) or set the Pages source to thedocs/folder (repository settings → Pages → Build and deployment → Deploy from a branch →/docs). The dashboard will be available athttps://<user>.github.io/<repo>/.
- Add new providers by creating a scraper in
src/lib/providersand registering it insrc/scrape.js. The list of models/vendors you want to watch belongs in the data modules undersrc/data/providers/(for examplegcpModels.jsorazureOpenAiModels.js), so most catalog tweaks become data-only pull requests. Optional scraping utilities (like theaws-toolkitprovider) can live alongside the doc scrapers without being part of the default run. - Update
src/data/regionEntries.jsif you need to add extra regions or sustainability metadata. The enrichment/runtime glue now lives insrc/lib/regionMetadata.js, so contributors editing region facts no longer need to touch code. - If a documentation layout changes, the validators in
npm run scrapewill fail so you can adjust the selectors quickly.
The dashboard now ingests carbon intensity and renewable electricity share from the official Our World in Data CSV exports:
- Run
npm run data:syncto downloadcarbon-intensity-electricity.csv,share-electricity-renewables.csv, andshare-electricity-nuclear.csv(plus metadata) intodata/owid/. These files include the official citations from Ember (2025) and Energy Institute (2025) as published on Our World in Data—please retain that attribution in downstream work. - Run
npm run scrape(optionally with--pretty) to regeneratedata/hosting-regions.jsonanddocs/data/hosting-regions.json. During this stepsrc/lib/regionMetadata.jsrehydrates every region with the latest carbon intensity, renewable share, nuclear share, and metadata (year + source URLs). - Commit the refreshed dataset and redeploy the
docs/site as usual.
The scraper now throws a helpful error if those CSVs are missing, so always run
npm run data:syncbefore attempting a fresh scrape.
If the OWID schema changes, adjust the column names defined near the top of src/lib/regionMetadata.js (or in src/lib/owidDatasets.js if the parsing format changes). The loader expects the same short-column labels (co2_intensity__gco2_kwh, renewable_share_of_electricity__pct, and nuclear_share_of_electricity__pct) that the OWID downloader provides when useColumnShortNames=true. The metadata JSON files downloaded alongside each CSV are parsed automatically so the exported dataset can surface the exact citation text (“Ember (2025); Energy Institute – Statistical Review of World Energy (2025) – with major processing by Our World in Data”).
The default scrape relies on public documentation, but you can spot-check Anthropic availability via the AWS SDK:
- Ensure you have AWS credentials with
bedrock:ListFoundationModelspermission. You can export them withAWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY, or drop them into a.envfile (keys are loaded automatically). - Install the AWS SDK dependency once:
npm install. - Run the scraper with
npm run scrape -- --provider aws-toolkit(or combine it with the defaults:npm run scrape -- --provider aws,aws-toolkit).
The resulting dataset appears with id aws-toolkit and adds evidence data straight from the API (including whether a model is ON_DEMAND or INFERENCE_PROFILE per region).