Reformat sample_calc #60

dltamayo · 2025-10-14T16:22:06Z

Moving towards de-hardcoding of metadata fields

Merge PR

- using metadata in JSON format instead of hard-coded list access - still hard-coding metadata access though - switched to pandas for outputting instead of csv - removed old code Eventual goal is to have any combination of clinical metadata on a patient/sample level be used for clustering/grouping

Header will now be included in aggregated .csvs, instead of hard coded within notebook.

Copilot

Pull Request Overview

Refactors sample calculation workflow to produce per-sample CSVs with richer metadata, then aggregates them for plotting, moving away from hard‑coded column definitions. Introduces a new SAMPLE_AGGREGATE process and updates downstream notebook to align with renamed metadata fields; also adds dynamic CPU allocation to TCRDIST3.

Introduce per-sample output file naming (stats/, vdj/) and new aggregation step replacing prior collectFile usage.
Update analysis notebook to rely on CSV headers instead of hard-coded column name lists and rename metadata fields (patient_id -> subject_id, sample_id -> sample).
Add dynamic CPU allocation logic to TCRDIST3 process.

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
subworkflows/local/sample.nf	Replaces collectFile outputs with collection + aggregation processes; updates inputs to plotting step.
notebooks/sample_stats_template.qmd	Changes CSV reading strategy and metadata field names; adjusts plotting hover fields and grouping.
modules/local/sample/tcrdist3.nf	Adds dynamic cpus directive based on task.memory.
modules/local/sample/sample_calc.nf	Changes output file paths/name patterns and serializes sample metadata as JSON.
modules/local/sample/sample_aggregate.nf	New process to concatenate multiple per-sample CSVs into aggregated files.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

modules/local/sample/sample_calc.nf

Copilot · 2025-10-14T19:17:11Z

modules/local/sample/tcrdist3.nf

+        if (task.memory > 256.GB)
+            return 16 * task.attempt
+        else if (task.memory > 64.GB)
+            return 8 * task.attempt
+        else if (task.memory > 4.GB)
+            return 4 * task.attempt
+        else
+            return 2 * task.attempt
+        }


The cpus directive depends on task.memory, which itself is derived later by the memory block; this circular dependency can result in task.memory being undefined or default when cpus is evaluated. Derive both cpu and memory from the same underlying metric (e.g., count_table.size()) or compute memory first in a variable and base cpus on that variable instead of task.memory.

Suggested change

if (task.memory > 256.GB)

return 16 * task.attempt

else if (task.memory > 64.GB)

return 8 * task.attempt

else if (task.memory > 4.GB)

return 4 * task.attempt

else

return 2 * task.attempt

}

def sz = count_table.size()

def mb = 1024 * 1024

if (sz > 26 * mb)

return 16 * task.attempt

else if (sz > 20 * mb)

return 8 * task.attempt

else if (sz > 10 * mb)

return 4 * task.attempt

else

return 2 * task.attempt

}

Copilot · 2025-10-14T19:17:11Z

subworkflows/local/sample.nf

+    SAMPLE_CALC.out.sample_csv.collect().set { sample_csv_files }
+    SAMPLE_CALC.out.v_family_csv.collect().set { v_family_csv_files }
+    SAMPLE_CALC.out.d_family_csv.collect().set { d_family_csv_files }
+    SAMPLE_CALC.out.j_family_csv.collect().set { j_family_csv_files }

-    SAMPLE_CALC.out.v_family_csv
-        .collectFile(name: 'v_family.csv', sort: true,
-                     storeDir: "${params.outdir}/sample")
-        .set { v_family_csv }
-
-    SAMPLE_CALC.out.d_family_csv
-        .collectFile(name: 'd_family.csv', sort: true,
-                     storeDir: "${params.outdir}/sample")
-        .set { d_family_csv }
-
-    SAMPLE_CALC.out.j_family_csv
-        .collectFile(name: 'j_family.csv', sort: true,
-                     storeDir: "${params.outdir}/sample")
-        .set { j_family_csv }
+    SAMPLE_AGG_STAT(sample_csv_files, "sample_stats.csv")
+    SAMPLE_AGG_V(v_family_csv_files, "v_family.csv")
+    SAMPLE_AGG_D(d_family_csv_files, "d_family.csv")
+    SAMPLE_AGG_J(j_family_csv_files, "j_family.csv")


Previously collectFile used sort:true and storeDir to produce deterministic, published aggregated CSVs; replacing it with collect + SAMPLE_AGGREGATE removes sorting and output publication to params.outdir, potentially changing downstream reproducibility and final output availability. Reintroduce explicit sorting (e.g., sort by path or sample id before concatenation) and add publishDir/storeDir on aggregation processes to preserve prior behavior.

It's hard to understand if fixed order is important, is it @dltamayo? also same comment as above on aggregating using collectFile as opposed to python applies.

Copilot · 2025-10-14T19:17:11Z

modules/local/sample/sample_aggregate.nf

+    """
+    python3 <<EOF
+    import pandas as pd
+
+    input_files = [${csv_files.collect { '"' + it.getName() + '"' }.join(', ')}]
+    dfs = [pd.read_csv(f) for f in input_files]
+    merged = pd.concat(dfs, axis=0, ignore_index=True)
+    merged.to_csv("${output_file}", index=False)
+    EOF
+    """


Aggregation does not enforce a deterministic ordering of rows (prior implementation used sort:true), so run-to-run concatenation order may vary depending on file staging order. Apply a consistent sort (e.g., sort input_files list or sort merged by key columns) before writing to ensure reproducible outputs.

Not sure if concatenation order is important as stated by Copilot above, but is there a reason why we need SAMPLE_AGGREGATE as separate python-based process as opposed to seemingly simpler collectFile operator?

Copilot · 2025-10-14T19:17:11Z

notebooks/sample_stats_template.qmd

-                        'num_clones', 'num_TCRs', 'simpson_index', 'simpson_index_corrected', 'clonality',
-                        'num_prod', 'num_nonprod', 'pct_prod', 'pct_nonprod',
-                        'productive_cdr3_avg_len', 'num_convergent', 'ratio_convergent'])
+df = pd.read_csv(sample_stats_csv, sep=',')


Previously the file was read with header=None and an explicit column name list; removing that implies the CSV now contains a header row, but upstream changes do not show evidence that a header has been added. If the file still lacks a header, the first data row will be misinterpreted as column names—retain header=None with explicit names or confirm the writer now emits headers.

Suggested change

df = pd.read_csv(sample_stats_csv, sep=',')

# Replace the column names below with the actual column names for sample_stats_csv

df = pd.read_csv(sample_stats_csv, sep=',', header=None, names=['col1', 'col2', 'col3', 'col4'])

These headers are now present in the sample_stats.csv (output of the minimal test run), so I think this is not an issue.

sample,subject_id,timepoint,origin,num_clones,num_TCRs,simpson_index,simpson_index_corrected,clonality,num_prod,num_nonprod,pct_prod,pct_nonprod,productive_cdr3_avg_len,num_convergent,ratio_convergent

Copilot · 2025-10-14T19:17:12Z

notebooks/sample_stats_template.qmd

-                              'TCRBV22', 'TCRBV23', 'TCRBV24', 'TCRBV25', 'TCRBV26',
-                              'TCRBV27', 'TCRBV28', 'TCRBV29', 'TCRBV30'])
-v_family = v_family.sort_values(by=['patient_id', 'timepoint'])
+v_family = pd.read_csv(v_family_csv, sep=',')


The earlier version explicitly supplied the V gene column names with header=None; switching to default header handling assumes the aggregated v_family CSV now has a header row. If it does not, sorting by 'subject_id' will raise a KeyError because that column name will instead be a data value—either restore header=None with explicit names or ensure headers are written upstream.

Suggested change

v_family = pd.read_csv(v_family_csv, sep=',')

v_family = pd.read_csv(v_family_csv, sep=',', header=None, index_col=None,

names=['sample', 'file', 'subject_id', 'timepoint', 'origin', 'v_family', 'proportion'])

The individual stats tables contain the columns that were present for each sample, so I think this is not an issue but rather a feature - column names are not hardcoded anymore.
Example:

subject_id,timepoint,origin,TRBD1,TRBD2

dltamayo and others added 5 commits October 2, 2025 11:19

Merge pull request #59 from KarchinLab/main

fec6545

Merge PR

Test dynamic cpu allocation

6bf098e

Debug gene_fam csv

a2ce8cb

Reformat sample_calc, _stats, notebook

87daed7

Header will now be included in aggregated .csvs, instead of hard coded within notebook.

dltamayo requested a review from dimalvovs October 14, 2025 19:06

dimalvovs requested a review from Copilot October 14, 2025 19:14

Copilot AI reviewed Oct 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reformat sample_calc #60

Reformat sample_calc #60

dltamayo commented Oct 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

dimalvovs Oct 20, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

dimalvovs Oct 20, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

dimalvovs Oct 20, 2025 •

edited

Loading

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

dimalvovs Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	df = pd.read_csv(sample_stats_csv, sep=',')
	# Replace the column names below with the actual column names for sample_stats_csv
	df = pd.read_csv(sample_stats_csv, sep=',', header=None, names=['col1', 'col2', 'col3', 'col4'])

	v_family = pd.read_csv(v_family_csv, sep=',')
	v_family = pd.read_csv(v_family_csv, sep=',', header=None, index_col=None,
	names=['sample', 'file', 'subject_id', 'timepoint', 'origin', 'v_family', 'proportion'])

Reformat sample_calc #60

Are you sure you want to change the base?

Reformat sample_calc #60

Conversation

dltamayo commented Oct 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

dimalvovs Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

dimalvovs Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

dimalvovs Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

dimalvovs Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dimalvovs Oct 20, 2025 •

edited

Loading