Skip to content

feat: add EIA data preprocessing pipeline for US solar generation#133

Open
mahendra-918 wants to merge 1 commit intoopenclimatefix:mainfrom
mahendra-918:feature/us-eia-preprocessing
Open

feat: add EIA data preprocessing pipeline for US solar generation#133
mahendra-918 wants to merge 1 commit intoopenclimatefix:mainfrom
mahendra-918:feature/us-eia-preprocessing

Conversation

@mahendra-918
Copy link
Contributor

Description

Adds preprocessing for US solar data from EIA to work with ocf-data-sampler. This is Phase 2 of #103, building on the data fetching from PR #127.

The script transforms EIA data into the required format (time_utc, location_id dimensions) and estimates capacity from historical generation using the 99th percentile. Supports all major US regions and defaults to US48 to avoid duplicate data.

New files:

  • preprocess_eia_data.py - preprocessing pipeline
  • test_eia_preprocessing.py - tests (21 new tests)

Updated:

  • training_model_new_country.md - added US usage examples

Fixes #103 (Phase 2)

How Has This Been Tested?

Added 21 tests covering schema transformation, capacity estimation, validation, and the full pipeline. All tests passing (31/31).

Ran manually with:
python -m open_data_pvnet.scripts.preprocess_eia_data
--start-date 2024-01-01
--end-date 2024-01-07
--regions CAISO
--output ./test.zarr

  • Yes

If your changes affect data processing, have you plotted any changes? i.e. have you done a quick sanity check?

  • Yes - checked output schema and capacity values look reasonable

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@mahendra-918
Copy link
Contributor Author

@peterdudfield @siddharth7113 Hi! This PR implements Phase 2 of #103 - preprocessing EIA data for US solar generation.

I've addressed all the feedback from PR #127:

  • Defaults to hourly data
  • Uses US48 region to avoid duplicates
  • Outputs to Zarr format
  • Leverages the automatic pagination from the previous PR

Added 21 comprehensive tests (all passing). Would appreciate your review when you have a chance!

Let me know if anything needs changes. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[META] Extend PVNet solar generation model to the United States

1 participant