Skip to content

Conversation

dltamayo
Copy link
Collaborator

Moving towards de-hardcoding of metadata fields

dltamayo and others added 5 commits October 2, 2025 11:19
- using metadata in JSON format instead of hard-coded list access
  - still hard-coding metadata access though
- switched to pandas for outputting instead of csv
- removed old code

Eventual goal is to have any combination of clinical metadata on a patient/sample level be used for clustering/grouping
Header will now be included in aggregated .csvs, instead of hard coded within notebook.
@dltamayo dltamayo requested a review from dimalvovs October 14, 2025 19:06
@dimalvovs dimalvovs requested a review from Copilot October 14, 2025 19:14
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Refactors sample calculation workflow to produce per-sample CSVs with richer metadata, then aggregates them for plotting, moving away from hard‑coded column definitions. Introduces a new SAMPLE_AGGREGATE process and updates downstream notebook to align with renamed metadata fields; also adds dynamic CPU allocation to TCRDIST3.

  • Introduce per-sample output file naming (stats/, vdj/) and new aggregation step replacing prior collectFile usage.
  • Update analysis notebook to rely on CSV headers instead of hard-coded column name lists and rename metadata fields (patient_id -> subject_id, sample_id -> sample).
  • Add dynamic CPU allocation logic to TCRDIST3 process.

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
subworkflows/local/sample.nf Replaces collectFile outputs with collection + aggregation processes; updates inputs to plotting step.
notebooks/sample_stats_template.qmd Changes CSV reading strategy and metadata field names; adjusts plotting hover fields and grouping.
modules/local/sample/tcrdist3.nf Adds dynamic cpus directive based on task.memory.
modules/local/sample/sample_calc.nf Changes output file paths/name patterns and serializes sample metadata as JSON.
modules/local/sample/sample_aggregate.nf New process to concatenate multiple per-sample CSVs into aggregated files.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +6 to +14
if (task.memory > 256.GB)
return 16 * task.attempt
else if (task.memory > 64.GB)
return 8 * task.attempt
else if (task.memory > 4.GB)
return 4 * task.attempt
else
return 2 * task.attempt
}
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpus directive depends on task.memory, which itself is derived later by the memory block; this circular dependency can result in task.memory being undefined or default when cpus is evaluated. Derive both cpu and memory from the same underlying metric (e.g., count_table.size()) or compute memory first in a variable and base cpus on that variable instead of task.memory.

Suggested change
if (task.memory > 256.GB)
return 16 * task.attempt
else if (task.memory > 64.GB)
return 8 * task.attempt
else if (task.memory > 4.GB)
return 4 * task.attempt
else
return 2 * task.attempt
}
def sz = count_table.size()
def mb = 1024 * 1024
if (sz > 26 * mb)
return 16 * task.attempt
else if (sz > 20 * mb)
return 8 * task.attempt
else if (sz > 10 * mb)
return 4 * task.attempt
else
return 2 * task.attempt
}

Copilot uses AI. Check for mistakes.

Comment on lines +37 to +45
SAMPLE_CALC.out.sample_csv.collect().set { sample_csv_files }
SAMPLE_CALC.out.v_family_csv.collect().set { v_family_csv_files }
SAMPLE_CALC.out.d_family_csv.collect().set { d_family_csv_files }
SAMPLE_CALC.out.j_family_csv.collect().set { j_family_csv_files }

SAMPLE_CALC.out.v_family_csv
.collectFile(name: 'v_family.csv', sort: true,
storeDir: "${params.outdir}/sample")
.set { v_family_csv }

SAMPLE_CALC.out.d_family_csv
.collectFile(name: 'd_family.csv', sort: true,
storeDir: "${params.outdir}/sample")
.set { d_family_csv }

SAMPLE_CALC.out.j_family_csv
.collectFile(name: 'j_family.csv', sort: true,
storeDir: "${params.outdir}/sample")
.set { j_family_csv }
SAMPLE_AGG_STAT(sample_csv_files, "sample_stats.csv")
SAMPLE_AGG_V(v_family_csv_files, "v_family.csv")
SAMPLE_AGG_D(d_family_csv_files, "d_family.csv")
SAMPLE_AGG_J(j_family_csv_files, "j_family.csv")
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously collectFile used sort:true and storeDir to produce deterministic, published aggregated CSVs; replacing it with collect + SAMPLE_AGGREGATE removes sorting and output publication to params.outdir, potentially changing downstream reproducibility and final output availability. Reintroduce explicit sorting (e.g., sort by path or sample id before concatenation) and add publishDir/storeDir on aggregation processes to preserve prior behavior.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to understand if fixed order is important, is it @dltamayo? also same comment as above on aggregating using collectFile as opposed to python applies.

Comment on lines +14 to +23
"""
python3 <<EOF
import pandas as pd
input_files = [${csv_files.collect { '"' + it.getName() + '"' }.join(', ')}]
dfs = [pd.read_csv(f) for f in input_files]
merged = pd.concat(dfs, axis=0, ignore_index=True)
merged.to_csv("${output_file}", index=False)
EOF
"""
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregation does not enforce a deterministic ordering of rows (prior implementation used sort:true), so run-to-run concatenation order may vary depending on file staging order. Apply a consistent sort (e.g., sort input_files list or sort merged by key columns) before writing to ensure reproducible outputs.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if concatenation order is important as stated by Copilot above, but is there a reason why we need SAMPLE_AGGREGATE as separate python-based process as opposed to seemingly simpler collectFile operator?

'num_clones', 'num_TCRs', 'simpson_index', 'simpson_index_corrected', 'clonality',
'num_prod', 'num_nonprod', 'pct_prod', 'pct_nonprod',
'productive_cdr3_avg_len', 'num_convergent', 'ratio_convergent'])
df = pd.read_csv(sample_stats_csv, sep=',')
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the file was read with header=None and an explicit column name list; removing that implies the CSV now contains a header row, but upstream changes do not show evidence that a header has been added. If the file still lacks a header, the first data row will be misinterpreted as column names—retain header=None with explicit names or confirm the writer now emits headers.

Suggested change
df = pd.read_csv(sample_stats_csv, sep=',')
# Replace the column names below with the actual column names for sample_stats_csv
df = pd.read_csv(sample_stats_csv, sep=',', header=None, names=['col1', 'col2', 'col3', 'col4'])

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

@dimalvovs dimalvovs Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These headers are now present in the sample_stats.csv (output of the minimal test run), so I think this is not an issue.

sample,subject_id,timepoint,origin,num_clones,num_TCRs,simpson_index,simpson_index_corrected,clonality,num_prod,num_nonprod,pct_prod,pct_nonprod,productive_cdr3_avg_len,num_convergent,ratio_convergent

'TCRBV22', 'TCRBV23', 'TCRBV24', 'TCRBV25', 'TCRBV26',
'TCRBV27', 'TCRBV28', 'TCRBV29', 'TCRBV30'])
v_family = v_family.sort_values(by=['patient_id', 'timepoint'])
v_family = pd.read_csv(v_family_csv, sep=',')
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier version explicitly supplied the V gene column names with header=None; switching to default header handling assumes the aggregated v_family CSV now has a header row. If it does not, sorting by 'subject_id' will raise a KeyError because that column name will instead be a data value—either restore header=None with explicit names or ensure headers are written upstream.

Suggested change
v_family = pd.read_csv(v_family_csv, sep=',')
v_family = pd.read_csv(v_family_csv, sep=',', header=None, index_col=None,
names=['sample', 'file', 'subject_id', 'timepoint', 'origin', 'v_family', 'proportion'])

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The individual stats tables contain the columns that were present for each sample, so I think this is not an issue but rather a feature - column names are not hardcoded anymore.
Example:

subject_id,timepoint,origin,TRBD1,TRBD2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants