Skip to content

Update proteomics example data with collection sites and update all tutorials accordingly#62

Merged
enryH merged 12 commits intomainfrom
update_batch_corr_example
Mar 3, 2026
Merged

Update proteomics example data with collection sites and update all tutorials accordingly#62
enryH merged 12 commits intomainfrom
update_batch_corr_example

Conversation

@enryH
Copy link
Collaborator

@enryH enryH commented Mar 3, 2026

Summary

Add collection sites to proteomics dataset.

List of changes proposed in this PR (pull-request)

  • update proteomics creation script (includes preprocessing)
  • update all tutorial to use new curated dataset

Checks

@enryH enryH marked this pull request as ready for review March 3, 2026 17:33
@enryH enryH requested a review from Copilot March 3, 2026 17:37
@enryH enryH changed the title Update batch corr example Update proteomics example data with collection sites and update all tutorials accordingly Mar 3, 2026
@enryH enryH merged commit 32e389a into main Mar 3, 2026
13 of 14 checks passed
@enryH enryH deleted the update_batch_corr_example branch March 3, 2026 17:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Alzheimer proteomics example dataset to include a collection-site field and aligns documentation/tutorial notebooks to use the newly curated combined dataset for downstream examples.

Changes:

  • Pin pingouin to <0.6.0 due to upstream breaking column-name changes.
  • Extend Alzheimer proteomics example-data generation to add collection_site from metadata and persist it in the combined dataset.
  • Update API example notebooks/scripts to load the curated combined dataset from this repo and account for the new collection_site column.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pyproject.toml Pins pingouin to avoid breaking changes.
docs/example_data/alzheimer_proteomics.py Adds metadata load and creates collection_site column in the combined dataset.
docs/example_data/alzheimer_proteomics.ipynb Notebook equivalent of the example-data update (adds collection_site).
docs/api_examples/normalization_analysis.py Switches example to curated combined dataset and updates plotting defaults.
docs/api_examples/normalization_analysis.ipynb Notebook equivalent of the normalization example update.
docs/api_examples/diff_regulation_anova_ttest_two_groups.py Points to curated dataset and drops collection_site for numeric-only analysis.
docs/api_examples/diff_regulation_anova_ttest_two_groups.ipynb Notebook equivalent of the two-group diff regulation update.
docs/api_examples/diff_regulation_ancova.py Points to curated dataset and drops collection_site for numeric-only analysis.
docs/api_examples/diff_regulation_ancova.ipynb Notebook equivalent of the ANCOVA example update.
docs/api_examples/batch_correction.py Updates batch-correction example to use collection_site as the batch variable and the curated dataset.
docs/api_examples/batch_correction.ipynb Notebook equivalent of the batch-correction example update.
.github/workflows/cicd.yml Minor formatting + sets matrix fail-fast: false.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +166 to +167
PCs, fig = run_and_plot_pca(standard_normalize(X), y, n_components=4)
ax = plot_umap(X, y)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After batch correction, PCA is run on standard_normalize(X) but UMAP is plotted from raw X. Since run_umap expects scaled input, the UMAP plot should likely use the same standardized data to match the PCA view (or the variable/function naming should be adjusted if unscaled is intended).

Suggested change
PCs, fig = run_and_plot_pca(standard_normalize(X), y, n_components=4)
ax = plot_umap(X, y)
X_scaled = standard_normalize(X)
PCs, fig = run_and_plot_pca(X_scaled, y, n_components=4)
ax = plot_umap(X_scaled, y)

Copilot uses AI. Check for mistakes.
"https://raw.githubusercontent.com/RasmussenLab/njab/"
"HEAD/docs/tutorial/data/alzheimer/"
)
META: str = "meta.csv" # clincial data
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: “clincial” should be “clinical”.

Suggested change
META: str = "meta.csv" # clincial data
META: str = "meta.csv" # clinical data

Copilot uses AI. Check for mistakes.
" \"https://raw.githubusercontent.com/RasmussenLab/njab/\"\n",
" \"HEAD/docs/tutorial/data/alzheimer/\"\n",
")\n",
"META: str = \"meta.csv\" # clincial data\n",
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment string: “clincial” should be “clinical”.

Suggested change
"META: str = \"meta.csv\" # clincial data\n",
"META: str = \"meta.csv\" # clinical data\n",

Copilot uses AI. Check for mistakes.
# jupyter:
# jupytext:
# cell_metadata_filter: tags,-all
# cell_data_filter: tags,-all
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jupytext header uses cell_data_filter, but the rest of the repo’s Jupytext headers use cell_metadata_filter. If this key is not recognized, tags like hide-input/hide-output may not round-trip correctly when syncing notebooks. Use the same cell_metadata_filter key here for consistency and compatibility.

Suggested change
# cell_data_filter: tags,-all
# cell_metadata_filter: tags,-all

Copilot uses AI. Check for mistakes.
"metadata": {
"jupytext": {
"cell_metadata_filter": "tags,-all"
"cell_data_filter": "tags,-all"
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebook metadata sets jupytext.cell_data_filter, but other notebooks in this repo use jupytext.cell_metadata_filter. If cell_data_filter isn’t supported by your Jupytext tooling, tag filtering may be ignored when converting/syncing. Consider switching back to cell_metadata_filter for consistency with the rest of the docs.

Suggested change
"cell_data_filter": "tags,-all"
"cell_metadata_filter": "tags,-all"

Copilot uses AI. Check for mistakes.
Comment on lines 154 to 159
# %%time
X = median_impute(omics)
X = acore.batch_correction.combat_batch_correction(
X.join(y),
batch_col="site",
X.join(y.astype("category")),
batch_col=y.name,
)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_col=y.name will break if group_label is set to None (since Series.rename(None) yields y.name is None). Given the parameter is typed Optional and documented as “optional rename”, consider ensuring y always has a non-empty name (e.g., fall back to group) and/or pass batch_col from the known group/group_label variable instead of y.name.

Copilot uses AI. Check for mistakes.
Comment on lines 141 to +144
omics_imp = median_impute(omics)
omics_imp_scaled = standard_normalize(omics_imp)
PCs, fig = run_and_plot_pca(omics_imp, y, METACOL_LABEL, n_components=4)
ax = plot_umap(omics_imp, y, METACOL_LABEL)

PCs, fig = run_and_plot_pca(omics_imp, y, n_components=4)
ax = plot_umap(omics_imp, y)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omics_imp_scaled is computed but not used; run_and_plot_pca/plot_umap are instead called with the unscaled omics_imp. Since acore.decomposition.umap.run_umap explicitly expects scaled input (X_scaled), this likely produces inconsistent plots. Either use omics_imp_scaled in the PCA/UMAP calls or drop the unused scaling step to avoid confusion.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants