Skip to content

peerigon/llm-hosting-overview

Repository files navigation

LLM Hosting Overview

This workspace contains a Node.js utility that scrapes hyperscaler documentation (AWS Bedrock, Google Vertex AI, Azure AI Foundry) to keep track of where closed-source frontier models are hosted in Europe and how those locations score on sustainability metrics. The output is a single JSON file that you can feed into reports, dashboards, or downstream scripts.

Getting started

npm install
npm run data:sync   # fetch OWID carbon, renewables & nuclear datasets
npm run scrape

The default run writes data/hosting-regions.json. Pass --pretty for formatted JSON and --provider to scope the run (for example npm run scrape -- --provider aws,gcp). An optional AWS Bedrock Toolkit provider is available (aws-toolkit) if you want to validate Anthropic availability through the AWS SDK; see AWS Bedrock toolkit provider below for credentials.

How it works

  • Playwright HTTP client fetches the public docs without launching a browser (so it can be used in a GitHub Action).
  • Cheerio parses the documentation tables/lists for the model/region combinations we care about.
  • Region metadata enriches each availability entry with carbon intensity, CFE % (if Google discloses it), and tags for nuclear/sustainable regions. The values come from the 2024 Our World in Data datasets (carbon intensity, renewable share, nuclear share) highlighted in the project brief. Each dataset credits the upstream sources that OWID cites (e.g., Ember & Energy Institute 2025 for carbon intensity, Ember & IEA for renewables, and Ember for nuclear share).
  • Zod validates the final JSON payload before it gets written so CI can fail early if the upstream markup changes.
  • AWS Bedrock Toolkit (optional) taps into the AWS SDK (aws-toolkit provider) to list Anthropic models per EU region and detect whether a row is limited to inference profiles. This is handy when AWS removes the * from their docs but the underlying API still exposes inference mode differences.

Output structure

{
  "generatedAt": "2025-01-12T10:15:00.123Z",
  "providers": [
    {
      "id": "aws",
      "cloud": "aws",
      "service": "bedrock",
      "dataset": "Amazon Bedrock model availability (EU focus)",
      "sourceUrls": ["https://docs.aws.amazon.com/bedrock/latest/userguide/models-regions.html"],
      "models": [
        {
          "vendor": "Anthropic",
          "model": "Claude Sonnet 4.5",
          "slug": "aws-bedrock-anthropic-claude-sonnet-4-5",
          "regions": [
            {
              "regionKey": "aws:eu-central-1",
              "status": "available",
              "notes": ["Available via cross-region inference within the same AWS geography."]
            }
          ]
        }
      ]
    },
    "..."
  ],
  "regionMetadata": {
    "aws:eu-central-1": {
      "displayName": "Europe (Frankfurt)",
      "metrics": {
        "carbonIntensity": 344.1,
        "carbonUnits": "gCO2/kWh",
        "cleanEnergyShare": 0.52
      },
      "tags": ["aws", "eu"]
    }
  }
}

Automating updates

The script is designed to run headlessly, so you can trigger a periodic GitHub Action to refresh the dataset or deploy the UI. This repo ships with .github/workflows/deploy-pages.yml, which:

  1. Installs dependencies.
  2. Downloads the OWID sustainability datasets.
  3. Runs npm run scrape (copying the JSON into docs/data).
  4. Publishes the docs/ directory to GitHub Pages (via the built-in pages deployment).

To enable it:

  1. In GitHub → Settings → Pages, choose GitHub Actions as the build source.
  2. (Optional) Store AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY as repository secrets if you want the workflow to run the aws-toolkit provider; otherwise it will just use the default doc scrapers.
  3. Push to main (or trigger the workflow manually) to build and publish the dashboard. Subsequent pushes re-generate the dataset and redeploy the site automatically.

Lightweight UI (GitHub Pages ready)

  • The docs/ directory contains a vanilla HTML/CSS/JS dashboard that fetches docs/data/hosting-regions.json, lets you filter/sort/group entries, and highlights sustainability tags.
  • After every npm run scrape, the dataset is automatically copied into docs/data so the UI stays current. You can also run npm run web:sync manually if needed.
  • To publish the UI on GitHub Pages, either enable the included GitHub Actions workflow (deploy-pages.yml, described above) or set the Pages source to the docs/ folder (repository settings → Pages → Build and deployment → Deploy from a branch → /docs). The dashboard will be available at https://<user>.github.io/<repo>/.

Extending

  • Add new providers by creating a scraper in src/lib/providers and registering it in src/scrape.js. The list of models/vendors you want to watch belongs in the data modules under src/data/providers/ (for example gcpModels.js or azureOpenAiModels.js), so most catalog tweaks become data-only pull requests. Optional scraping utilities (like the aws-toolkit provider) can live alongside the doc scrapers without being part of the default run.
  • Update src/data/regionEntries.js if you need to add extra regions or sustainability metadata. The enrichment/runtime glue now lives in src/lib/regionMetadata.js, so contributors editing region facts no longer need to touch code.
  • If a documentation layout changes, the validators in npm run scrape will fail so you can adjust the selectors quickly.

Energy dataset refresh

The dashboard now ingests carbon intensity and renewable electricity share from the official Our World in Data CSV exports:

  1. Run npm run data:sync to download carbon-intensity-electricity.csv, share-electricity-renewables.csv, and share-electricity-nuclear.csv (plus metadata) into data/owid/. These files include the official citations from Ember (2025) and Energy Institute (2025) as published on Our World in Data—please retain that attribution in downstream work.
  2. Run npm run scrape (optionally with --pretty) to regenerate data/hosting-regions.json and docs/data/hosting-regions.json. During this step src/lib/regionMetadata.js rehydrates every region with the latest carbon intensity, renewable share, nuclear share, and metadata (year + source URLs).
  3. Commit the refreshed dataset and redeploy the docs/ site as usual.

The scraper now throws a helpful error if those CSVs are missing, so always run npm run data:sync before attempting a fresh scrape.

If the OWID schema changes, adjust the column names defined near the top of src/lib/regionMetadata.js (or in src/lib/owidDatasets.js if the parsing format changes). The loader expects the same short-column labels (co2_intensity__gco2_kwh, renewable_share_of_electricity__pct, and nuclear_share_of_electricity__pct) that the OWID downloader provides when useColumnShortNames=true. The metadata JSON files downloaded alongside each CSV are parsed automatically so the exported dataset can surface the exact citation text (“Ember (2025); Energy Institute – Statistical Review of World Energy (2025) – with major processing by Our World in Data”).

AWS Bedrock toolkit provider (optional)

The default scrape relies on public documentation, but you can spot-check Anthropic availability via the AWS SDK:

  1. Ensure you have AWS credentials with bedrock:ListFoundationModels permission. You can export them with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY, or drop them into a .env file (keys are loaded automatically).
  2. Install the AWS SDK dependency once: npm install.
  3. Run the scraper with npm run scrape -- --provider aws-toolkit (or combine it with the defaults: npm run scrape -- --provider aws,aws-toolkit).

The resulting dataset appears with id aws-toolkit and adds evidence data straight from the API (including whether a model is ON_DEMAND or INFERENCE_PROFILE per region).

About

Keeping track of the availability, regions and carbon intensity of LLMs on hyperscalers

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published