Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
07cccad
try
kennyworkman Oct 24, 2025
bd1fc2d
atlasx evals
hannahle Oct 25, 2025
01228b2
Merge remote-tracking branch 'origin/main' into kenny/eval-harness-pr…
kennyworkman Oct 28, 2025
c3ba178
globals, signals
kennyworkman Oct 28, 2025
1b1df95
atlasx docs and more evals
hannahle Oct 28, 2025
da79cb1
fixes
kennyworkman Oct 29, 2025
d74fca7
cell typing grader
kennyworkman Oct 30, 2025
46f5abb
Merge remote-tracking branch 'origin/main' into kenny/eval-harness-pr…
kennyworkman Nov 5, 2025
c45c5a2
changes
kennyworkman Nov 5, 2025
263e67a
more
kennyworkman Nov 5, 2025
c0ae128
changes
kennyworkman Nov 5, 2025
7b5d829
Merge branch 'main' into kenny/eval-harness-prod-kernel
kennyworkman Nov 5, 2025
007fe04
graders
kennyworkman Nov 6, 2025
9055939
more graders
kennyworkman Nov 6, 2025
306aba6
changes
kennyworkman Nov 6, 2025
5c7b3bb
proportion consistency
kennyworkman Nov 6, 2025
9c4b4b0
atlas evals
hannahle Nov 6, 2025
09bd0ab
qc evals
hannahle Nov 7, 2025
95597dc
more atlas evals
hannahle Nov 7, 2025
3a1fdcd
fun
kennyworkman Nov 7, 2025
77d89f8
spatial adjacency
kennyworkman Nov 7, 2025
e13912b
thing
kennyworkman Nov 7, 2025
7fe803a
done
kennyworkman Nov 7, 2025
4a8cea9
more evals
hannahle Nov 7, 2025
1db3e74
added ami grader and clustering tests
hannahle Nov 7, 2025
dcc5d16
update batch json
hannahle Nov 7, 2025
edefe2b
update batch json
hannahle Nov 7, 2025
6d2f129
added DESCRIPTION.MD for atlas
hannahle Nov 7, 2025
bb3937a
done
hannahle Nov 7, 2025
3a1d097
vizgen evals
hsmurali Nov 7, 2025
33865ba
vizgen evals doc
hsmurali Nov 7, 2025
cb9fb88
Merge pull request #151 from latchbio/harihara/vizgen_evals
hsmurali Nov 7, 2025
ae88c87
eval version 0
Nov 8, 2025
73f7be4
edit test results for hsc typing
hannahle Nov 10, 2025
4b00604
pseudobulk eval
hannahle Nov 10, 2025
95fd278
motif eval
hannahle Nov 10, 2025
c99f3d6
clean up
kennyworkman Nov 11, 2025
306a803
clean
kennyworkman Nov 11, 2025
4009059
modified DE prompts
hsmurali Nov 11, 2025
2976c61
Merge pull request #155 from latchbio/harihara/vizgen_evals
hsmurali Nov 11, 2025
715ba87
add eval for differential expression
Nov 11, 2025
d61d7bc
relax qc thresholds
hannahle Nov 11, 2025
5b7efb1
small fix
hannahle Nov 11, 2025
a3fba58
updated mito
hannahle Nov 11, 2025
cba8bbe
coarse clustering
hannahle Nov 11, 2025
03cdda5
coarse threshold prompt edits
hannahle Nov 11, 2025
7bc799a
fix norm eval
Nov 11, 2025
ed11838
updates
hannahle Nov 11, 2025
05f5217
clustering evals
hannahle Nov 11, 2025
ddbfaad
atlasx pudates
hannahle Nov 11, 2025
c5b2d3b
threshold for spatial contiguity
hannahle Nov 12, 2025
983d1ff
adjusted thresholds for cell type consistency per condition
hannahle Nov 12, 2025
0bf0c70
adjusted thresholds for cell type consistency per condition
hannahle Nov 12, 2025
8461e58
merge stuff
kennyworkman Nov 12, 2025
4280ffd
pull out internal evals
kennyworkman Nov 12, 2025
7ad090b
batch eval script
kennyworkman Nov 12, 2025
bd8c222
updated xenium eval
Nov 12, 2025
a9bf6ca
updated xenium eval
Nov 12, 2025
5eb8277
updated xenium evals
Nov 13, 2025
5ed51d4
Final eval set
hsmurali Nov 13, 2025
ee67d27
Merge pull request #160 from latchbio/harihara/vizgen_evals
hsmurali Nov 13, 2025
15e98b8
edit thresholds for spatial enrichment
hannahle Nov 12, 2025
69ab6bd
updated evals
hannahle Nov 13, 2025
21b4783
added tcell and macrophage evals
LoocasGoose Nov 21, 2025
cc9abef
Add eval 1-maggie
Nov 21, 2025
e4aa583
deleted
Nov 21, 2025
d9bd8b0
breast cancer luminal onevsall marker jaccard v1 ttest
Nov 21, 2025
63e7fb8
remaining 2 evals
Nov 21, 2025
4a67e04
added tcell and macrophage evals
LoocasGoose Nov 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,7 @@ __pycache__
/.envrc
/.venv

sandbox
runtime/mount/agent_config/evals/result_*.json
runtime/mount/agent_config/evals/results/*
runtime/mount/agent_config/context/notebook_context/cells.md
runtime/mount/agent_config/evals/results
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ dependencies = [
"dill>=0.3.9",
"anndata>=0.10.10",
"anthropic>=0.71.0",
"scikit-learn>=1.7.2",
"scipy>=1.15.2",
"statsmodels>=0.14.5",
]
requires-python = "==3.11.*"
readme = "README.md"
Expand Down
203 changes: 139 additions & 64 deletions runtime/mount/agent_config/context/technology_docs/atlasxomics.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
<!-- markdownlint-disable -->
## Analysis Guideline

This is the **authoritative step-by-step pipeline** for AtlasxOmics experiment. Follow steps in order.

1. **Experiment Setup** - If not clear from original request, ask users to confirm if they want to perform analysis on **gene activity score AnnData** (recommended) or **motif enrichment scores AnnData**.
2. **Data Loading** - load data using **Scanpy** and display it with `w_h5`.
3. **Clustering (workflow only)** - Launch the AtlasXOmics clustering workflow using `w_workflow(wf_name="wf.__init__.opt_workflow", ...)`. Fallback to `scanpy` only if this fails.
4. **Differential Gene Activity or Motif Enrichment Comparison** - Use `w_workflow(wf_name="wf.__init__.compare_workflow", ...)`
5. **Cell Type Annotation** - assign biological meaning to clusters using gene sets.
1. **Data Loading**
2. **Quality Control**
3. **Batch Correction (for multi-sample datasets)**
4. **Clustering (workflow only)** - Launch the AtlasXOmics clustering workflow using `w_workflow(wf_name="wf.__init__.opt_workflow", ...)`. Fallback to `scanpy` only if this fails.
5. **Differential Gene Activity or Motif Enrichment Comparison** - Use `w_workflow(wf_name="wf.__init__.compare_workflow", ...)`
6. **Cell Type Annotation** — Use CellGuide marker database (see file `technology_docs/marker_cell_typing.md`)

The section below defines detailed guidelines for each of the above steps.

Expand All @@ -21,12 +21,102 @@ The section below defines detailed guidelines for each of the above steps.
5. Must end with `execution = w.value` so a button is displayed to run the workflow.
</workflow_rules>

### **Data Loading**:
- Locate the appropriate Latch path for either:
- `combined_sm_ge.h5ad` — gene activity scores
- `combined_sm_motifs.h5ad` — motif enrichment results
- Use `LPath` to load the file. **Always** prefer **full, human-readable latch:// paths** (not node IDs).
- Once loaded, visualize the AnnData object using the w_h5 widget for inspection.
### **Quality Control**:

- Use the `snapatac2` library for computing and visualizing ATAC-seq quality metrics.
- **MANDATORY**: Before running any QC or filtering steps, **verify whether the AnnData object has already been pre-processed or quality-controlled.**

```python
import snapatac2 as snap
```
- **Key QC metrics**: Check which metrics already exist in `adata.obs`, run adaptive filtering with those first, and only compute any missing metrics afterward if needed.
- Fragment Size Distribution
- TSS Enrichment (TSSE)
- FRiP — Fraction of Reads in Peaks
- Nucleosome Signal
- Number of Fragments per Cell
- Mitochondrial Read Fraction

- **Adaptive, per-sample QC filtering**:
Inputs (adata.obs): n_fragments, tsse, frip, nucleosome_signal, mitochondrial_fraction
Batch key: sample
Heuristic (per-batch quantiles):
n_fragments: keep [max(q5, 1k), min(q99.5, 50k)]
tsse: ≥ min(q10, 2)
frip: ≥ min(q10, 0.2)
nucleosome_signal: ≤ max(q90, 4)
mitochondrial_fraction: ≤ max(q90, 0.10)

#### 1. Fragment Size Distribution

**Purpose:** Assess nucleosome periodicity and library quality.
**Expected pattern:**
- **80–300 bp:** Nucleosome-free (open chromatin)
- **~150–200 bp:** Mono-nucleosome peak
- **~300–400 bp:** Di-nucleosome peak
- **>500 bp:** Multi-nucleosome or artifacts

**Example:**
```python
fig = snap.pl.frag_size_distr(data, show=False)
fig.update_yaxes(type="log")
```

#### 2. TSS Enrichment (TSSE)

**Purpose:** Quantify enrichment of accessible fragments near transcription start sites.
- **High TSSE (≥ 5–10):** Strong promoter accessibility, good quality
- **Low TSSE (< 4):** Poor signal, low complexity, or over-digestion

**Example:**
```python
snap.metrics.tsse(data)
```

#### 3. FRiP — Fraction of Reads in Peaks

**Purpose:** Quantify the share of fragments falling inside called peaks; higher FRiP means cleaner regulatory signal (≈0.2 good, <0.1 noisy).

**Example:**
```python
snap.metrics.frip(adata, regions, inplace=True, n_jobs=8)
```

**Inputs:**
- `adata`: AnnData or list of AnnData objects to annotate; writes scores to `adata.obs` when `inplace=True`.
- `regions`: dict mapping peak-set names to BED paths or genomic interval lists.
- `n_jobs`: parallel workers (use `-1` for all cores).

**Note:** Run `snap.pp.import_data(...)` beforehand to load fragment data.

#### 4. Nucleosome Signal

**Purpose:** Ratio of mono/di-nucleosomal to short fragments.
- **Low (< 2):** Good chromatin accessibility
- **High (> 4):** Over-digested or low-quality libraries

#### 5. Number of Fragments per Cell (`adata.obs["n_fragment"]`)

**Purpose:** Assess sequencing depth and data sparsity per cell/barcode.
- **Low fragments (< 1 k):** Dropouts or ambient noise
- **Extremely high:** Doublets or multiplets

#### 6. Mitochondrial Read Fraction (`adata.obs["frac_dup"]`)

**Purpose:** Detect low-quality or dying cells with excessive mitochondrial reads.
- **High (> 10 %):** Possible cell stress or broken nuclei

---

### **Batch Correction (SnapATAC2)**

If `adata.obs['sample']` contains more than one sample, run a batch-correction pass before clustering:

1. After QC, call `snap.pp.mnc_correct(adatas, batch='sample')` to apply the modified mutual-nearest-neighbour correction.
2. Optionally refine with `snap.pp.harmony(adatas, batch='sample', max_iter_harmony=20)` to harmonise global structure.
3. Recompute embeddings (e.g., `snap.tl.spectral`, `snap.tl.umap`) using the corrected representation, and proceed with clustering.

---

### **Clustering (workflow only)**:
- Use `w_workflow` to launch the AtlasXOmics clustering workflow.
Expand All @@ -35,6 +125,7 @@ The section below defines detailed guidelines for each of the above steps.
### Clustering Workflow Parameters

- Strictly follow <workflow_rules>.
- Because clustering outcomes are sensitive to input settings, always set default `n_features`, `resolution`, and `n_comps` to multiple values so users can compare results and select the most meaningful clustering after the workflow finishes running.

#### **Required**
- `project_name` *(str)*
Expand All @@ -58,16 +149,16 @@ The section below defines detailed guidelines for each of the above steps.
Genomic bin size (default: `5000`)

- `n_features` *(List[int])*
Top accessible tiles to use, e.g., `[25000]`
Top accessible tiles to use, e.g., `[25000, 50000, 100000]`

- `resolution` *(List[float])*
Clustering resolution, e.g., `[1.0]`
Clustering resolution, e.g., `[0.5, 1.0, 1.25, 1.5]`

- `varfeat_iters` *(List[int])*
Iterations for variable feature selection, e.g., `[1]`

- `n_comps` *(List[int])*
Dimensionality reduction components, e.g., `[30]`
Dimensionality reduction components, e.g., `[30, 50]`

- `min_cluster_size` *(int)*
Minimum cells per cluster
Expand Down Expand Up @@ -183,21 +274,6 @@ w = w_workflow(
execution = w.value
```

#### Example Implementation
```python
viewer = w_h5(ann_data=adata)
value = viewer.value

# Get the current selections
if value['lasso_points']:
print(f"User selected {len(value['lasso_points'])} regions")
print(f"The embedding used for the lasso selection is {value['lasso_points_obsm']}")
for i, region in enu merate(value['lasso_points']):
print(f"Region {i}: {len(region)} points")

# Proceed to create an `adata_subset` based on lasso-selected points
```

### **Differential Gene Activity or Motif Enrichment Comparison (workflow only)**
- Use `w_workflow` to launch the AtlasXOmics comparison workflow.
- Automatically infer the correct grouping column from `adata.obs` (`condition`, `sample`, or `cluster`).
Expand Down Expand Up @@ -248,13 +324,7 @@ execution = w.value
#### How to construct `compare_config.json`
To run a comparison workflow, you must generate a `compare_config.json` file that defines which cells belong to each group.

What to Ask the User:
- Comparison Target: Ask the user what biological groups or conditions they want to compare. For example:
- "Diseased vs Healthy"
- "Sample A vs Sample B"
- "Cluster 5 vs Cluster 7"

How to Identify Groups:
- Comparison Target: Ask the user what biological groups or conditions they want to compare.
- You must infer which column in adata.obs encodes this grouping (e.g. "condition", "sample", or "cluster").
- Then, for each group (A and B), filter all cell barcodes in `adata.obs.index` that match the selected value.

Expand Down Expand Up @@ -287,23 +357,38 @@ latch_path = LPath.upload(Path(local_path), remote_path)
groupings_file = LatchFile(remote_path)
```

### **Cell Type Annotation**
## Data Assumptions

- If the dataset context is unclear, first ask the user to confirm the **organism** and **tissue type**. **Do NOT proceed** until the user has answered the question.
- **Always render a form with sensible defaults** to avoid tedious manual input. The form should support **multiple candidate cell types**, e.g. one row per cell type:
- `cell_type`: **text input**, pre-filled with a common or inferred cell type.
- `marker_genes`: **multiselect widget**, pre-populated with default marker genes for that cell type.
- You **must auto-populate all fields with reasonable defaults using domain knowledge**. Users should only adjust values if needed, not enter them from scratch.
- Add a **button** after the form to trigger gene set scoring.
- For **spatial ATAC-seq** only, infer cell identity by computing **gene activity or gene set scores** (e.g., `scanpy.tl.score_genes`) and ranking cell types based on marker enrichment.
AtlasXOmics datasets have the following data conventions. Assume this structure exists without asking the user to confirm.

## Data Assumptions
### Raw Data
Raw data consists of fragment files (fragments.tsv.gz) and 'spatial/' folders which contain images, image metadata, and barcode-image mappings stored as csv files. Every experiment (designated with the unique 'Run ID' Dxxxxx where x is a digit), is associated with distinct raw data. Fragment files and spatial folders are stored on different files paths in Latch Data.

AtlasXOmics datasets follow a standard output folder schema across all projects.
Assume this structure exists without asking the user to confirm.
In the default AtlasXomics Workspace (13502), fragment files are stored in the path `/chromap_outs/[Run_ID]/chromap_output/fragments.tsv.gz`. Spatial folders are stored in the path `/Images_spatial/[Run_ID]/spatial`.

### Example Folder Layout
root/
In collaborator Workspaces (not 13502), fragment files and spatial folders are stored together in the parent directory corresponding to Run ID. Frament files are store in the path `.../Raw_Data/[Run_ID]/chromap_output/fragments.tsv.gz`, BED file at `.../Raw_Data/[Run_ID]/chromap_output/aln.bed`, spatial folders at `.../Raw_Data/[Run_ID]]/spatial`

### Workflow Outputs

#### Clustering Workflow (optimize_snap)
The clustering workflow (optimize_snap, wf.__init__.opt_workflow) stores outputs on the path `/snap_opts/[project name]/` where "project name" is designated by the user in the Workflow input parameters. The output directory has the following structure:
[project name]/
├── figures/
├── medians.csv
├── set1_ts5000-vf500000-cr1-0-vi1-nc30
├── set2_ts5000-vf500000-cr1-0-vi1-nc40

Each folder with the prefix 'setN' corresponds to a combination of input clustering parameters. Each contains a 'combined.h5ad' file which stores the AnnData object generated with the specified clustering parameters, with .X as a tile matrix.

The figures/ directory contains plots saved as .pdf files for QC and to guide selection of cluster parameters for downstream analysis. The medians.csv contains QC metrics for the project.

#### atx_snap and create ArchRProject Workflows

These two Workflows create files to be analyzed in Plots. A Workflow takes as input one or multiple Run IDs and corresponding raw data. It creates the outputs detailed below. The `combined_sm_ge.h5ad` and `combined_sm_motifs.h5ad` files are AnnData objects with .X as gene accessibility data and motif enrichment data, respectively. Each object contains data for all Run IDs specified in the inputs. Runs are designated by the column 'sample' in AnnData .obs.

In the default AtlasXomics Workspace (13502), results are stored in `/snap_outs/[project name]` where project name is specified in the Execution inputs. In customer Workspaces, data is stored in `.../Processed_Data/[project name]`.

project_name/
├── cluster_coverages/
├── condition_coverages/
├── figures/
Expand All @@ -328,22 +413,12 @@ root/
├── D02310_NG07345_m_converted.h5ad
├── D02310_NG07345_SeuratObjMotif.rds
├── D02310_NG07345_SeuratObj.rds
├── enrichMotifs_clusters.rds
├── enrichMotifs_condition_1.rds
├── enrichMotifs_sample.rds
├── markersGS_clusters.rds
├── markersGS_condition_1.rds
├── markersGS_sample.rds


**Important files to pay attention to**
- `combined_sm_ge.h5ad`: Stores gene activity score for every spot
- `combined_sm_motifs.h5ad`: Stores motif enrichment score for every spot
- `combined_sm_ge.h5ad`: Stores gene activity score for every spot for all Run IDs
- `combined_sm_motifs.h5ad`: Stores motif enrichment score for every spot for all Run IDs
- Both files follow the standard AnnData structure with `obs`, `uns`, `obsm`, and `obsp` components as detailed below.
- Folder ending with `_ArchRProject`: An `ArchR` project
- Folder ending with `_coverages`: Contains `.bw` files which can be visualized in the IGV browser.
- Folder ending with `_ArchRProject`: An `ArchR` project

**Standard Fields inside an AnnData Object**
Example structure of `combined_sm_ge.h5ad`
Expand Down
Loading