Update proteomics example data with collection sites and update all tutorials accordingly#62
Update proteomics example data with collection sites and update all tutorials accordingly#62
Conversation
- needs to be set to main before merge this time.
- update how group renaming is used
There was a problem hiding this comment.
Pull request overview
Updates the Alzheimer proteomics example dataset to include a collection-site field and aligns documentation/tutorial notebooks to use the newly curated combined dataset for downstream examples.
Changes:
- Pin
pingouinto<0.6.0due to upstream breaking column-name changes. - Extend Alzheimer proteomics example-data generation to add
collection_sitefrom metadata and persist it in the combined dataset. - Update API example notebooks/scripts to load the curated combined dataset from this repo and account for the new
collection_sitecolumn.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Pins pingouin to avoid breaking changes. |
docs/example_data/alzheimer_proteomics.py |
Adds metadata load and creates collection_site column in the combined dataset. |
docs/example_data/alzheimer_proteomics.ipynb |
Notebook equivalent of the example-data update (adds collection_site). |
docs/api_examples/normalization_analysis.py |
Switches example to curated combined dataset and updates plotting defaults. |
docs/api_examples/normalization_analysis.ipynb |
Notebook equivalent of the normalization example update. |
docs/api_examples/diff_regulation_anova_ttest_two_groups.py |
Points to curated dataset and drops collection_site for numeric-only analysis. |
docs/api_examples/diff_regulation_anova_ttest_two_groups.ipynb |
Notebook equivalent of the two-group diff regulation update. |
docs/api_examples/diff_regulation_ancova.py |
Points to curated dataset and drops collection_site for numeric-only analysis. |
docs/api_examples/diff_regulation_ancova.ipynb |
Notebook equivalent of the ANCOVA example update. |
docs/api_examples/batch_correction.py |
Updates batch-correction example to use collection_site as the batch variable and the curated dataset. |
docs/api_examples/batch_correction.ipynb |
Notebook equivalent of the batch-correction example update. |
.github/workflows/cicd.yml |
Minor formatting + sets matrix fail-fast: false. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| PCs, fig = run_and_plot_pca(standard_normalize(X), y, n_components=4) | ||
| ax = plot_umap(X, y) |
There was a problem hiding this comment.
After batch correction, PCA is run on standard_normalize(X) but UMAP is plotted from raw X. Since run_umap expects scaled input, the UMAP plot should likely use the same standardized data to match the PCA view (or the variable/function naming should be adjusted if unscaled is intended).
| PCs, fig = run_and_plot_pca(standard_normalize(X), y, n_components=4) | |
| ax = plot_umap(X, y) | |
| X_scaled = standard_normalize(X) | |
| PCs, fig = run_and_plot_pca(X_scaled, y, n_components=4) | |
| ax = plot_umap(X_scaled, y) |
| "https://raw.githubusercontent.com/RasmussenLab/njab/" | ||
| "HEAD/docs/tutorial/data/alzheimer/" | ||
| ) | ||
| META: str = "meta.csv" # clincial data |
There was a problem hiding this comment.
Typo in comment: “clincial” should be “clinical”.
| META: str = "meta.csv" # clincial data | |
| META: str = "meta.csv" # clinical data |
| " \"https://raw.githubusercontent.com/RasmussenLab/njab/\"\n", | ||
| " \"HEAD/docs/tutorial/data/alzheimer/\"\n", | ||
| ")\n", | ||
| "META: str = \"meta.csv\" # clincial data\n", |
There was a problem hiding this comment.
Typo in comment string: “clincial” should be “clinical”.
| "META: str = \"meta.csv\" # clincial data\n", | |
| "META: str = \"meta.csv\" # clinical data\n", |
| # jupyter: | ||
| # jupytext: | ||
| # cell_metadata_filter: tags,-all | ||
| # cell_data_filter: tags,-all |
There was a problem hiding this comment.
The Jupytext header uses cell_data_filter, but the rest of the repo’s Jupytext headers use cell_metadata_filter. If this key is not recognized, tags like hide-input/hide-output may not round-trip correctly when syncing notebooks. Use the same cell_metadata_filter key here for consistency and compatibility.
| # cell_data_filter: tags,-all | |
| # cell_metadata_filter: tags,-all |
| "metadata": { | ||
| "jupytext": { | ||
| "cell_metadata_filter": "tags,-all" | ||
| "cell_data_filter": "tags,-all" |
There was a problem hiding this comment.
Notebook metadata sets jupytext.cell_data_filter, but other notebooks in this repo use jupytext.cell_metadata_filter. If cell_data_filter isn’t supported by your Jupytext tooling, tag filtering may be ignored when converting/syncing. Consider switching back to cell_metadata_filter for consistency with the rest of the docs.
| "cell_data_filter": "tags,-all" | |
| "cell_metadata_filter": "tags,-all" |
| # %%time | ||
| X = median_impute(omics) | ||
| X = acore.batch_correction.combat_batch_correction( | ||
| X.join(y), | ||
| batch_col="site", | ||
| X.join(y.astype("category")), | ||
| batch_col=y.name, | ||
| ) |
There was a problem hiding this comment.
batch_col=y.name will break if group_label is set to None (since Series.rename(None) yields y.name is None). Given the parameter is typed Optional and documented as “optional rename”, consider ensuring y always has a non-empty name (e.g., fall back to group) and/or pass batch_col from the known group/group_label variable instead of y.name.
| omics_imp = median_impute(omics) | ||
| omics_imp_scaled = standard_normalize(omics_imp) | ||
| PCs, fig = run_and_plot_pca(omics_imp, y, METACOL_LABEL, n_components=4) | ||
| ax = plot_umap(omics_imp, y, METACOL_LABEL) | ||
|
|
||
| PCs, fig = run_and_plot_pca(omics_imp, y, n_components=4) | ||
| ax = plot_umap(omics_imp, y) |
There was a problem hiding this comment.
omics_imp_scaled is computed but not used; run_and_plot_pca/plot_umap are instead called with the unscaled omics_imp. Since acore.decomposition.umap.run_umap explicitly expects scaled input (X_scaled), this likely produces inconsistent plots. Either use omics_imp_scaled in the PCA/UMAP calls or drop the unused scaling step to avoid confusion.
Summary
Add collection sites to proteomics dataset.
List of changes proposed in this PR (pull-request)
Checks