-
Notifications
You must be signed in to change notification settings - Fork 149
Store and Access HVG Gene Names in AnnData #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
0ff67a0
03b8511
916c07c
4750d7a
2a7e0bb
7274ef9
4746be4
0a8bd53
5371147
57cb76e
98d205d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -12,3 +12,4 @@ notebooks/ | |
| *.slurm | ||
| temp | ||
| wandb/ | ||
| tasks/ | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| # Repository Guidelines | ||
|
|
||
| ## Project Structure & Module Organization | ||
| - Core library lives in `src/state`, with CLI entrypoints in `state/__main__.py` and subcommands under `state/_cli`. | ||
| - Model configs and resources sit in `src/state/configs`, embeddings helpers in `src/state/emb`, and transition utilities in `src/state/tx`. | ||
| - Tests reside in `tests/` (`test_*.py`), runnable without extra fixtures. Example TOML configs are in `examples/`. Helper scripts live in `scripts/` for inference and embedding. | ||
| - Artifacts or scratch outputs should go in `tmp/` or a user-created path; keep `assets/` for checked-in visuals/resources only. | ||
|
|
||
| ## Build, Test, and Development Commands | ||
| - Create/activate env and install in editable mode: `uv tool install -e .`. | ||
| - Run the CLI: `uv run state --help` (entrypoints `emb` and `tx`). | ||
| - Format/lint: `uv run ruff check .` (auto-fixes enabled by default config). | ||
| - Run tests: `uv run pytest` (adds `src/` to `PYTHONPATH` via standard layout). | ||
|
|
||
| ## Coding Style & Naming Conventions | ||
| - Python 3.10–3.12; prefer type hints on public functions. | ||
| - Use 4-space indentation, 120-char max line length (`ruff.toml`), and avoid bare `except` (E722 is explicitly ignored—only use when necessary). | ||
| - Modules and files use `snake_case`; classes `CamelCase`; constants `UPPER_SNAKE_CASE`. | ||
| - Keep CLI options descriptive and align new configs with the existing TOML examples. | ||
|
|
||
| ## Testing Guidelines | ||
| - Add unit tests alongside new features in `tests/` with filenames `test_*.py` and functions `test_*`. | ||
| - Cover edge cases around data loading, config parsing, and checkpoint handling; favor small fixtures over large data blobs. | ||
| - For regressions, reproduce with a failing test first, then implement the fix. | ||
|
|
||
| ## Commit & Pull Request Guidelines | ||
| - Follow the short, imperative style seen in history (`chore: …`, `patch: …`, or focused message without trailing punctuation). Reference issue/PR numbers where applicable. | ||
| - PRs should explain the change, risks, and testing done (`uv run pytest`, `uv run ruff check .`). Include CLI examples if you changed commands or configs. | ||
| - Keep diffs scoped; split unrelated changes into separate PRs. Include screenshots or logs only when UI/output changes are relevant. | ||
|
|
||
| ## Security & Configuration Tips | ||
| - Do not commit dataset paths or secrets; use environment variables or local config files kept out of git. | ||
| - Validate file paths in new CLI options and prefer existing config loaders under `state/_cli` to avoid duplicating logic. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| # Migration: HVG Gene Names Stored in AnnData Uns | ||
|
|
||
| ## Summary | ||
|
|
||
| Recent versions of STATE store highly variable gene (HVG) names in `adata.uns["X_hvg_var_names"]`. | ||
| This makes it possible for downstream tools to map `adata.obsm["X_hvg"]` columns back to gene IDs. | ||
|
|
||
| If you have preprocessed data created before this change, you can backfill the HVG names with the | ||
| script below. | ||
|
|
||
| ## Backward Compatibility | ||
|
|
||
| This change is fully backward compatible: | ||
|
|
||
| - **Existing preprocessed data**: Inference commands continue to work without modification. A | ||
| non-blocking warning is emitted recommending re-preprocessing, but execution proceeds normally. | ||
| - **Existing trained models**: Model checkpoints do not depend on this uns key. Gene names are | ||
| already captured in `var_dims.pkl` at training time. | ||
| - **Downstream code**: Code unaware of `X_hvg_var_names` simply ignores it. The obsm matrix | ||
| structure is unchanged. | ||
|
|
||
| ### Fallback Behavior | ||
|
|
||
| When `X_hvg_var_names` is absent, STATE attempts to recover gene names from | ||
| `adata.var_names[adata.var.highly_variable]`. This fallback succeeds as long as the | ||
| `highly_variable` boolean column remains in `adata.var`. | ||
|
|
||
| ### When Gene Names Are Unrecoverable | ||
|
|
||
| Gene names cannot be recovered if an h5ad file has `X_hvg` in obsm but: | ||
|
|
||
| 1. No `X_hvg_var_names` in uns, AND | ||
| 2. No `highly_variable` column in var (e.g., var was subset or modified) | ||
|
|
||
| This edge case would already be broken prior to this change. The new feature makes the mapping | ||
| explicit rather than implicit. | ||
|
|
||
| ## Backfill Script | ||
|
|
||
| For existing preprocessed files, run the following to add `X_hvg_var_names`: | ||
| ```python | ||
| import anndata as ad | ||
| import numpy as np | ||
|
|
||
| adata = ad.read_h5ad("your_preprocessed_data.h5ad") | ||
|
|
||
| if "X_hvg" in adata.obsm and "X_hvg_var_names" not in adata.uns: | ||
| if "highly_variable" in adata.var.columns: | ||
| hvg_names = adata.var_names[adata.var.highly_variable].tolist() | ||
| adata.uns["X_hvg_var_names"] = np.array(hvg_names, dtype=object) | ||
| adata.write_h5ad("your_preprocessed_data.h5ad") | ||
| print(f"Added {len(hvg_names)} HVG names to uns") | ||
| else: | ||
| print("Cannot backfill: 'highly_variable' column not found in adata.var") | ||
| else: | ||
| print("Backfill not needed or X_hvg not present") | ||
| ``` | ||
|
|
||
| ## Notes | ||
|
|
||
| - The uns key is stored as a NumPy array of Python strings for h5ad compatibility. | ||
| - Re-running `state tx preprocess_train` with the latest version will populate this automatically. | ||
| - The naming convention `{obsm_key}_var_names` allows for multiple obsm matrices with associated | ||
| gene names (e.g., `X_pca_var_names` if needed in the future). |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -76,6 +76,11 @@ def add_arguments_infer(parser: argparse.ArgumentParser): | |
| action="store_true", | ||
| help="Reduce logging verbosity.", | ||
| ) | ||
| parser.add_argument( | ||
| "--verbose", | ||
| action="store_true", | ||
| help="Show extra details about gene name mapping.", | ||
| ) | ||
| parser.add_argument( | ||
| "--tsv", | ||
| type=str, | ||
|
|
@@ -119,6 +124,8 @@ def run_tx_infer(args: argparse.Namespace): | |
| from tqdm import tqdm | ||
|
|
||
| from ...tx.models.state_transition import StateTransitionPerturbationModel | ||
| from ...tx.constants import HVG_VAR_NAMES_KEY | ||
| from ...tx.utils.hvg import get_hvg_var_names | ||
|
|
||
| # ----------------------- | ||
| # Helpers | ||
|
|
@@ -422,6 +429,26 @@ def pad_adata_with_tsv( | |
| # ----------------------- | ||
| adata = sc.read_h5ad(args.adata) | ||
|
|
||
| hvg_names = None | ||
| hvg_names_status = "n/a" | ||
| if args.embed_key == "X_hvg": | ||
| hvg_names = get_hvg_var_names(adata, obsm_key="X_hvg") | ||
| if hvg_names is None and not args.quiet: | ||
| print( | ||
| "Warning: adata.uns['X_hvg_var_names'] not found. " | ||
| "Downstream analysis (e.g., pdex) may not be able to map predictions to gene names. " | ||
| "Consider re-running preprocess_train with the latest STATE version." | ||
| ) | ||
| if hvg_names is not None: | ||
| hvg_names_status = "present" | ||
| else: | ||
| hvg_names_status = "missing" | ||
| if args.verbose and not args.quiet: | ||
| if hvg_names is not None: | ||
| print(f"HVG gene names found for X_hvg: {len(hvg_names)} entries.") | ||
| else: | ||
| print("HVG gene names not found for X_hvg.") | ||
|
|
||
| # optional TSV padding mode - pad with additional perturbation cells | ||
| if args.tsv: | ||
| if not args.quiet: | ||
|
|
@@ -904,6 +931,10 @@ def group_control_indices(group_name: str) -> np.ndarray: | |
| elif output_space == "all": | ||
| adata.X = sim_counts | ||
|
|
||
| # Store HVG names if available | ||
| if hvg_names is not None: | ||
| adata.uns[HVG_VAR_NAMES_KEY] = np.array(hvg_names, dtype=object) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing dimension validation when storing HVG names in inferMedium Severity When there's a dimension mismatch between the input Additional Locations (1) |
||
|
|
||
| if output_is_npy: | ||
| if pred_matrix is None: | ||
| raise ValueError("Predictions matrix is unavailable; cannot write .npy output") | ||
|
|
@@ -927,3 +958,4 @@ def group_control_indices(group_name: str) -> np.ndarray: | |
| print(f"Saved: {output_path}") | ||
| if counts_written and counts_out_target: | ||
| print(f"Saved count predictions to adata.{counts_out_target}") | ||
| print(f"HVG names: {hvg_names_status}") | ||
Uh oh!
There was an error while loading. Please reload this page.