-
Notifications
You must be signed in to change notification settings - Fork 1
Reformat sample_calc #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Merge PR
- using metadata in JSON format instead of hard-coded list access - still hard-coding metadata access though - switched to pandas for outputting instead of csv - removed old code Eventual goal is to have any combination of clinical metadata on a patient/sample level be used for clustering/grouping
Header will now be included in aggregated .csvs, instead of hard coded within notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Refactors sample calculation workflow to produce per-sample CSVs with richer metadata, then aggregates them for plotting, moving away from hard‑coded column definitions. Introduces a new SAMPLE_AGGREGATE process and updates downstream notebook to align with renamed metadata fields; also adds dynamic CPU allocation to TCRDIST3.
- Introduce per-sample output file naming (stats/, vdj/) and new aggregation step replacing prior collectFile usage.
- Update analysis notebook to rely on CSV headers instead of hard-coded column name lists and rename metadata fields (patient_id -> subject_id, sample_id -> sample).
- Add dynamic CPU allocation logic to TCRDIST3 process.
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
File | Description |
---|---|
subworkflows/local/sample.nf | Replaces collectFile outputs with collection + aggregation processes; updates inputs to plotting step. |
notebooks/sample_stats_template.qmd | Changes CSV reading strategy and metadata field names; adjusts plotting hover fields and grouping. |
modules/local/sample/tcrdist3.nf | Adds dynamic cpus directive based on task.memory. |
modules/local/sample/sample_calc.nf | Changes output file paths/name patterns and serializes sample metadata as JSON. |
modules/local/sample/sample_aggregate.nf | New process to concatenate multiple per-sample CSVs into aggregated files. |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
if (task.memory > 256.GB) | ||
return 16 * task.attempt | ||
else if (task.memory > 64.GB) | ||
return 8 * task.attempt | ||
else if (task.memory > 4.GB) | ||
return 4 * task.attempt | ||
else | ||
return 2 * task.attempt | ||
} |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cpus directive depends on task.memory, which itself is derived later by the memory block; this circular dependency can result in task.memory being undefined or default when cpus is evaluated. Derive both cpu and memory from the same underlying metric (e.g., count_table.size()) or compute memory first in a variable and base cpus on that variable instead of task.memory.
if (task.memory > 256.GB) | |
return 16 * task.attempt | |
else if (task.memory > 64.GB) | |
return 8 * task.attempt | |
else if (task.memory > 4.GB) | |
return 4 * task.attempt | |
else | |
return 2 * task.attempt | |
} | |
def sz = count_table.size() | |
def mb = 1024 * 1024 | |
if (sz > 26 * mb) | |
return 16 * task.attempt | |
else if (sz > 20 * mb) | |
return 8 * task.attempt | |
else if (sz > 10 * mb) | |
return 4 * task.attempt | |
else | |
return 2 * task.attempt | |
} |
Copilot uses AI. Check for mistakes.
SAMPLE_CALC.out.sample_csv.collect().set { sample_csv_files } | ||
SAMPLE_CALC.out.v_family_csv.collect().set { v_family_csv_files } | ||
SAMPLE_CALC.out.d_family_csv.collect().set { d_family_csv_files } | ||
SAMPLE_CALC.out.j_family_csv.collect().set { j_family_csv_files } | ||
|
||
SAMPLE_CALC.out.v_family_csv | ||
.collectFile(name: 'v_family.csv', sort: true, | ||
storeDir: "${params.outdir}/sample") | ||
.set { v_family_csv } | ||
|
||
SAMPLE_CALC.out.d_family_csv | ||
.collectFile(name: 'd_family.csv', sort: true, | ||
storeDir: "${params.outdir}/sample") | ||
.set { d_family_csv } | ||
|
||
SAMPLE_CALC.out.j_family_csv | ||
.collectFile(name: 'j_family.csv', sort: true, | ||
storeDir: "${params.outdir}/sample") | ||
.set { j_family_csv } | ||
SAMPLE_AGG_STAT(sample_csv_files, "sample_stats.csv") | ||
SAMPLE_AGG_V(v_family_csv_files, "v_family.csv") | ||
SAMPLE_AGG_D(d_family_csv_files, "d_family.csv") | ||
SAMPLE_AGG_J(j_family_csv_files, "j_family.csv") |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously collectFile used sort:true and storeDir to produce deterministic, published aggregated CSVs; replacing it with collect + SAMPLE_AGGREGATE removes sorting and output publication to params.outdir, potentially changing downstream reproducibility and final output availability. Reintroduce explicit sorting (e.g., sort by path or sample id before concatenation) and add publishDir/storeDir on aggregation processes to preserve prior behavior.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to understand if fixed order is important, is it @dltamayo? also same comment as above on aggregating using collectFile
as opposed to python applies.
""" | ||
python3 <<EOF | ||
import pandas as pd | ||
input_files = [${csv_files.collect { '"' + it.getName() + '"' }.join(', ')}] | ||
dfs = [pd.read_csv(f) for f in input_files] | ||
merged = pd.concat(dfs, axis=0, ignore_index=True) | ||
merged.to_csv("${output_file}", index=False) | ||
EOF | ||
""" |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregation does not enforce a deterministic ordering of rows (prior implementation used sort:true), so run-to-run concatenation order may vary depending on file staging order. Apply a consistent sort (e.g., sort input_files list or sort merged by key columns) before writing to ensure reproducible outputs.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if concatenation order is important as stated by Copilot above, but is there a reason why we need SAMPLE_AGGREGATE as separate python-based process as opposed to seemingly simpler collectFile operator?
'num_clones', 'num_TCRs', 'simpson_index', 'simpson_index_corrected', 'clonality', | ||
'num_prod', 'num_nonprod', 'pct_prod', 'pct_nonprod', | ||
'productive_cdr3_avg_len', 'num_convergent', 'ratio_convergent']) | ||
df = pd.read_csv(sample_stats_csv, sep=',') |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the file was read with header=None and an explicit column name list; removing that implies the CSV now contains a header row, but upstream changes do not show evidence that a header has been added. If the file still lacks a header, the first data row will be misinterpreted as column names—retain header=None with explicit names or confirm the writer now emits headers.
df = pd.read_csv(sample_stats_csv, sep=',') | |
# Replace the column names below with the actual column names for sample_stats_csv | |
df = pd.read_csv(sample_stats_csv, sep=',', header=None, names=['col1', 'col2', 'col3', 'col4']) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These headers are now present in the sample_stats.csv
(output of the minimal test run), so I think this is not an issue.
sample,subject_id,timepoint,origin,num_clones,num_TCRs,simpson_index,simpson_index_corrected,clonality,num_prod,num_nonprod,pct_prod,pct_nonprod,productive_cdr3_avg_len,num_convergent,ratio_convergent
'TCRBV22', 'TCRBV23', 'TCRBV24', 'TCRBV25', 'TCRBV26', | ||
'TCRBV27', 'TCRBV28', 'TCRBV29', 'TCRBV30']) | ||
v_family = v_family.sort_values(by=['patient_id', 'timepoint']) | ||
v_family = pd.read_csv(v_family_csv, sep=',') |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The earlier version explicitly supplied the V gene column names with header=None; switching to default header handling assumes the aggregated v_family CSV now has a header row. If it does not, sorting by 'subject_id' will raise a KeyError because that column name will instead be a data value—either restore header=None with explicit names or ensure headers are written upstream.
v_family = pd.read_csv(v_family_csv, sep=',') | |
v_family = pd.read_csv(v_family_csv, sep=',', header=None, index_col=None, | |
names=['sample', 'file', 'subject_id', 'timepoint', 'origin', 'v_family', 'proportion']) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The individual stats tables contain the columns that were present for each sample, so I think this is not an issue but rather a feature - column names are not hardcoded anymore.
Example:
subject_id,timepoint,origin,TRBD1,TRBD2
Moving towards de-hardcoding of metadata fields