latchbio · mdawn65 · Oct 24, 2025 · Oct 25, 2025 · Oct 28, 2025 · Oct 28, 2025
diff --git a/.gitignore b/.gitignore
@@ -7,4 +7,7 @@ __pycache__
 /.envrc
 /.venv
 
-sandbox
+runtime/mount/agent_config/evals/result_*.json
+runtime/mount/agent_config/evals/results/*
+runtime/mount/agent_config/context/notebook_context/cells.md
+runtime/mount/agent_config/evals/results
diff --git a/pyproject.toml b/pyproject.toml
@@ -23,6 +23,9 @@ dependencies = [
     "dill>=0.3.9",
     "anndata>=0.10.10",
     "anthropic>=0.71.0",
+    "scikit-learn>=1.7.2",
+    "scipy>=1.15.2",
+    "statsmodels>=0.14.5",
 ]
 requires-python = "==3.11.*"
 readme = "README.md"

diff --git a/runtime/mount/agent_config/context/technology_docs/atlasxomics.md b/runtime/mount/agent_config/context/technology_docs/atlasxomics.md
@@ -1,12 +1,12 @@
+<!-- markdownlint-disable -->
 ## Analysis Guideline
 
-This is the **authoritative step-by-step pipeline** for AtlasxOmics experiment. Follow steps in order. 
-
-1. **Experiment Setup** - If not clear from original request, ask users to confirm if they want to perform analysis on **gene activity score AnnData** (recommended) or **motif enrichment scores AnnData**. 
-2. **Data Loading** - load data using **Scanpy** and display it with `w_h5`.
-3. **Clustering (workflow only)** - Launch the AtlasXOmics clustering workflow using `w_workflow(wf_name="wf.__init__.opt_workflow", ...)`. Fallback to `scanpy` only if this fails.
-4. **Differential Gene Activity or Motif Enrichment Comparison** - Use `w_workflow(wf_name="wf.__init__.compare_workflow", ...)`
-5.  **Cell Type Annotation** - assign biological meaning to clusters using gene sets. 
+1. **Data Loading**
+2. **Quality Control** 
+3. **Batch Correction (for multi-sample datasets)**
+4. **Clustering (workflow only)** - Launch the AtlasXOmics clustering workflow using `w_workflow(wf_name="wf.__init__.opt_workflow", ...)`. Fallback to `scanpy` only if this fails.
+5. **Differential Gene Activity or Motif Enrichment Comparison** - Use `w_workflow(wf_name="wf.__init__.compare_workflow", ...)`
+6. **Cell Type Annotation** — Use CellGuide marker database (see file `technology_docs/marker_cell_typing.md`)
 
 The section below defines detailed guidelines for each of the above steps.
 
@@ -21,12 +21,102 @@ The section below defines detailed guidelines for each of the above steps.
 5. Must end with `execution = w.value` so a button is displayed to run the workflow.
 </workflow_rules>
 
-### **Data Loading**: 
-- Locate the appropriate Latch path for either:
-  - `combined_sm_ge.h5ad` — gene activity scores
-  - `combined_sm_motifs.h5ad` — motif enrichment results
-- Use `LPath` to load the file. **Always** prefer **full, human-readable latch:// paths** (not node IDs).
-- Once loaded, visualize the AnnData object using the w_h5 widget for inspection.
+### **Quality Control**:
+
+- Use the `snapatac2` library for computing and visualizing ATAC-seq quality metrics.
+- **MANDATORY**: Before running any QC or filtering steps, **verify whether the AnnData object has already been pre-processed or quality-controlled.**
+
+```python
+import snapatac2 as snap
+```
+- **Key QC metrics**: Check which metrics already exist in `adata.obs`, run adaptive filtering with those first, and only compute any missing metrics afterward if needed. 
+  - Fragment Size Distribution
+  - TSS Enrichment (TSSE)
+  - FRiP — Fraction of Reads in Peaks
+  - Nucleosome Signal
+  - Number of Fragments per Cell
+  - Mitochondrial Read Fraction
+
+- **Adaptive, per-sample QC filtering**:
+  Inputs (adata.obs): n_fragments, tsse, frip, nucleosome_signal, mitochondrial_fraction
+  Batch key: sample
+  Heuristic (per-batch quantiles):
+    n_fragments: keep [max(q5, 1k), min(q99.5, 50k)]
+    tsse: ≥ min(q10, 2)
+    frip: ≥ min(q10, 0.2)
+    nucleosome_signal: ≤ max(q90, 4)
+    mitochondrial_fraction: ≤ max(q90, 0.10)
+
+#### 1. Fragment Size Distribution
+
+**Purpose:** Assess nucleosome periodicity and library quality.  
+**Expected pattern:**
+- **80–300 bp:** Nucleosome-free (open chromatin)
+- **~150–200 bp:** Mono-nucleosome peak
+- **~300–400 bp:** Di-nucleosome peak
+- **>500 bp:** Multi-nucleosome or artifacts
+
+**Example:**
+```python
+fig = snap.pl.frag_size_distr(data, show=False)
+fig.update_yaxes(type="log")
+```
+
+#### 2. TSS Enrichment (TSSE)
+
+**Purpose:** Quantify enrichment of accessible fragments near transcription start sites.  
+- **High TSSE (≥ 5–10):** Strong promoter accessibility, good quality  
+- **Low TSSE (< 4):** Poor signal, low complexity, or over-digestion
+
+**Example:**
+```python
+snap.metrics.tsse(data)
+```
+
+#### 3. FRiP — Fraction of Reads in Peaks
+
+**Purpose:** Quantify the share of fragments falling inside called peaks; higher FRiP means cleaner regulatory signal (≈0.2 good, <0.1 noisy).
+
+**Example:**
+```python
+snap.metrics.frip(adata, regions, inplace=True, n_jobs=8)
+```
+
+**Inputs:**  
+- `adata`: AnnData or list of AnnData objects to annotate; writes scores to `adata.obs` when `inplace=True`.  
+- `regions`: dict mapping peak-set names to BED paths or genomic interval lists.  
+- `n_jobs`: parallel workers (use `-1` for all cores).
+
+**Note:** Run `snap.pp.import_data(...)` beforehand to load fragment data.
+
+#### 4. Nucleosome Signal
+
+**Purpose:** Ratio of mono/di-nucleosomal to short fragments.  
+- **Low (< 2):** Good chromatin accessibility  
+- **High (> 4):** Over-digested or low-quality libraries
+
+#### 5. Number of Fragments per Cell (`adata.obs["n_fragment"]`)
+
+**Purpose:** Assess sequencing depth and data sparsity per cell/barcode.  
+- **Low fragments (< 1 k):** Dropouts or ambient noise  
+- **Extremely high:** Doublets or multiplets
+
+#### 6. Mitochondrial Read Fraction (`adata.obs["frac_dup"]`)
+
+**Purpose:** Detect low-quality or dying cells with excessive mitochondrial reads.  
+- **High (> 10 %):** Possible cell stress or broken nuclei
+
+---
+
+### **Batch Correction (SnapATAC2)**
+
+If `adata.obs['sample']` contains more than one sample, run a batch-correction pass before clustering:
+
+1. After QC, call `snap.pp.mnc_correct(adatas, batch='sample')` to apply the modified mutual-nearest-neighbour correction.
+2. Optionally refine with `snap.pp.harmony(adatas, batch='sample', max_iter_harmony=20)` to harmonise global structure.
+3. Recompute embeddings (e.g., `snap.tl.spectral`, `snap.tl.umap`) using the corrected representation, and proceed with clustering.
+
+---
 
 ### **Clustering (workflow only)**: 
 - Use `w_workflow` to launch the AtlasXOmics clustering workflow.
@@ -35,6 +125,7 @@ The section below defines detailed guidelines for each of the above steps.
 ### Clustering Workflow Parameters
 
 - Strictly follow <workflow_rules>.
+- Because clustering outcomes are sensitive to input settings, always set default `n_features`, `resolution`, and `n_comps` to multiple values so users can compare results and select the most meaningful clustering after the workflow finishes running. 
 
 #### **Required**
 - `project_name` *(str)*  
@@ -58,16 +149,16 @@ The section below defines detailed guidelines for each of the above steps.
   Genomic bin size (default: `5000`)
 
 - `n_features` *(List[int])*  
-  Top accessible tiles to use, e.g., `[25000]`
+  Top accessible tiles to use, e.g., `[25000, 50000, 100000]`
 
 - `resolution` *(List[float])*  
-  Clustering resolution, e.g., `[1.0]`
+  Clustering resolution, e.g., `[0.5, 1.0, 1.25, 1.5]`
 
 - `varfeat_iters` *(List[int])*  
   Iterations for variable feature selection, e.g., `[1]`
 
 - `n_comps` *(List[int])*  
-  Dimensionality reduction components, e.g., `[30]`
+  Dimensionality reduction components, e.g., `[30, 50]`
 
 - `min_cluster_size` *(int)*  
   Minimum cells per cluster
@@ -183,21 +274,6 @@ w = w_workflow(
 execution = w.value
 ```
 
-#### Example Implementation
-```python
-viewer = w_h5(ann_data=adata)
-value = viewer.value
-
-# Get the current selections
-if value['lasso_points']:
-    print(f"User selected {len(value['lasso_points'])} regions")
-    print(f"The embedding used for the lasso selection is {value['lasso_points_obsm']}")
-    for i, region in enu merate(value['lasso_points']):
-        print(f"Region {i}: {len(region)} points")
-
-# Proceed to create an `adata_subset` based on lasso-selected points
-```
-
 ### **Differential Gene Activity or Motif Enrichment Comparison (workflow only)**
 - Use `w_workflow` to launch the AtlasXOmics comparison workflow.  
 - Automatically infer the correct grouping column from `adata.obs` (`condition`, `sample`, or `cluster`).  
@@ -248,13 +324,7 @@ execution = w.value
 #### How to construct `compare_config.json`
 To run a comparison workflow, you must generate a `compare_config.json` file that defines which cells belong to each group.
 
-What to Ask the User:
-- Comparison Target: Ask the user what biological groups or conditions they want to compare. For example:
-    - "Diseased vs Healthy"
-    - "Sample A vs Sample B"
-    - "Cluster 5 vs Cluster 7"
-
-How to Identify Groups:
+- Comparison Target: Ask the user what biological groups or conditions they want to compare.
 - You must infer which column in adata.obs encodes this grouping (e.g. "condition", "sample", or "cluster").
 - Then, for each group (A and B), filter all cell barcodes in `adata.obs.index` that match the selected value.
 
@@ -287,23 +357,38 @@ latch_path = LPath.upload(Path(local_path), remote_path)
 groupings_file = LatchFile(remote_path)
 ```
 
-### **Cell Type Annotation**
+## Data Assumptions
 
-- If the dataset context is unclear, first ask the user to confirm the **organism** and **tissue type**. **Do NOT proceed** until the user has answered the question. 
-- **Always render a form with sensible defaults** to avoid tedious manual input. The form should support **multiple candidate cell types**, e.g. one row per cell type:
-  - `cell_type`: **text input**, pre-filled with a common or inferred cell type.
-  - `marker_genes`: **multiselect widget**, pre-populated with default marker genes for that cell type.
-- You **must auto-populate all fields with reasonable defaults using domain knowledge**. Users should only adjust values if needed, not enter them from scratch.
-- Add a **button** after the form to trigger gene set scoring. 
-- For **spatial ATAC-seq** only, infer cell identity by computing **gene activity or gene set scores** (e.g., `scanpy.tl.score_genes`) and ranking cell types based on marker enrichment.
+AtlasXOmics datasets have the following data conventions. Assume this structure exists without asking the user to confirm.
 
-## Data Assumptions
+### Raw Data
+Raw data consists of fragment files (fragments.tsv.gz) and 'spatial/' folders which contain images, image metadata, and barcode-image mappings stored as csv files. Every experiment (designated with the unique 'Run ID' Dxxxxx where x is a digit), is associated with distinct raw data. Fragment files and spatial folders are stored on different files paths in Latch Data.
 
-AtlasXOmics datasets follow a standard output folder schema across all projects.
-Assume this structure exists without asking the user to confirm.
+In the default AtlasXomics Workspace (13502), fragment files are stored in the path `/chromap_outs/[Run_ID]/chromap_output/fragments.tsv.gz`.  Spatial folders are stored in the path `/Images_spatial/[Run_ID]/spatial`. 
 
-### Example Folder Layout
-root/
+In collaborator Workspaces (not 13502), fragment files and spatial folders are stored together in the parent directory corresponding to Run ID.  Frament files are store in the path `.../Raw_Data/[Run_ID]/chromap_output/fragments.tsv.gz`, BED file at `.../Raw_Data/[Run_ID]/chromap_output/aln.bed`, spatial folders at `.../Raw_Data/[Run_ID]]/spatial`
+
+### Workflow Outputs
+
+#### Clustering Workflow (optimize_snap)
+The clustering workflow (optimize_snap, wf.__init__.opt_workflow) stores outputs on the path `/snap_opts/[project name]/` where "project name" is designated by the user in the Workflow input parameters. The output directory has the following structure:
+[project name]/
+├── figures/
+├── medians.csv
+├── set1_ts5000-vf500000-cr1-0-vi1-nc30
+├── set2_ts5000-vf500000-cr1-0-vi1-nc40
+
+Each folder with the prefix 'setN' corresponds to a combination of input clustering parameters.  Each contains a 'combined.h5ad' file which stores the AnnData object generated with the specified clustering parameters, with .X as a tile matrix.
+
+The figures/ directory contains plots saved as .pdf files for QC and to guide selection of cluster parameters for downstream analysis. The medians.csv contains QC metrics for the project.
+
+#### atx_snap and create ArchRProject Workflows
+
+These two Workflows create files to be analyzed in Plots.  A Workflow takes as input one or multiple Run IDs and corresponding raw data.  It creates the outputs detailed below.  The `combined_sm_ge.h5ad` and `combined_sm_motifs.h5ad` files are AnnData objects with .X as gene accessibility data and motif enrichment data, respectively.  Each object contains data for all Run IDs specified in the inputs. Runs are designated by the column 'sample' in AnnData .obs.
+
+In the default AtlasXomics Workspace (13502), results are stored in `/snap_outs/[project name]` where project name is specified in the Execution inputs.  In customer Workspaces, data is stored in `.../Processed_Data/[project name]`.
+
+project_name/
 ├── cluster_coverages/
 ├── condition_coverages/
 ├── figures/
@@ -328,22 +413,12 @@ root/
 ├── D02310_NG07345_m_converted.h5ad
 ├── D02310_NG07345_SeuratObjMotif.rds
 ├── D02310_NG07345_SeuratObj.rds
-│
-├── enrichMotifs_clusters.rds
-├── enrichMotifs_condition_1.rds
-├── enrichMotifs_sample.rds
-│
-├── markersGS_clusters.rds
-├── markersGS_condition_1.rds
-├── markersGS_sample.rds
-
 
 **Important files to pay attention to**
-- `combined_sm_ge.h5ad`: Stores gene activity score for every spot
-- `combined_sm_motifs.h5ad`: Stores motif enrichment score for every spot
+- `combined_sm_ge.h5ad`: Stores gene activity score for every spot for all Run IDs
+- `combined_sm_motifs.h5ad`: Stores motif enrichment score for every spot for all Run IDs
 - Both files follow the standard AnnData structure with `obs`, `uns`, `obsm`, and `obsp` components as detailed below. 
-- Folder ending with `_ArchRProject`: An `ArchR` project
-- Folder ending with `_coverages`: Contains `.bw` files which can be visualized in the IGV browser.
+- Folder ending with `_ArchRProject`: An `ArchR` project 
 
 **Standard Fields inside an AnnData Object**
 Example structure of `combined_sm_ge.h5ad`