-
Notifications
You must be signed in to change notification settings - Fork 1
Reformat sample_calc #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
process SAMPLE_AGGREGATE { | ||
tag "${output_file}" | ||
label 'process_low' | ||
container "ghcr.io/karchinlab/tcrtoolkit:main" | ||
|
||
input: | ||
path csv_files | ||
val output_file | ||
|
||
output: | ||
path output_file, emit: aggregated_csv | ||
|
||
script: | ||
""" | ||
python3 <<EOF | ||
import pandas as pd | ||
input_files = [${csv_files.collect { '"' + it.getName() + '"' }.join(', ')}] | ||
dfs = [pd.read_csv(f) for f in input_files] | ||
merged = pd.concat(dfs, axis=0, ignore_index=True) | ||
merged.to_csv("${output_file}", index=False) | ||
EOF | ||
""" | ||
} |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -2,7 +2,16 @@ process TCRDIST3_MATRIX { | |||||||||||||||||||||||||||||||||||||||||
tag "${sample_meta.sample}" | ||||||||||||||||||||||||||||||||||||||||||
container "ghcr.io/karchinlab/tcrtoolkit:main" | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
cpus params.max_cpus | ||||||||||||||||||||||||||||||||||||||||||
cpus { | ||||||||||||||||||||||||||||||||||||||||||
if (task.memory > 256.GB) | ||||||||||||||||||||||||||||||||||||||||||
return 16 * task.attempt | ||||||||||||||||||||||||||||||||||||||||||
else if (task.memory > 64.GB) | ||||||||||||||||||||||||||||||||||||||||||
return 8 * task.attempt | ||||||||||||||||||||||||||||||||||||||||||
else if (task.memory > 4.GB) | ||||||||||||||||||||||||||||||||||||||||||
return 4 * task.attempt | ||||||||||||||||||||||||||||||||||||||||||
else | ||||||||||||||||||||||||||||||||||||||||||
return 2 * task.attempt | ||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||
Comment on lines
+6
to
+14
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The cpus directive depends on task.memory, which itself is derived later by the memory block; this circular dependency can result in task.memory being undefined or default when cpus is evaluated. Derive both cpu and memory from the same underlying metric (e.g., count_table.size()) or compute memory first in a variable and base cpus on that variable instead of task.memory.
Suggested change
Copilot uses AI. Check for mistakes. Positive FeedbackNegative Feedback |
||||||||||||||||||||||||||||||||||||||||||
memory { | ||||||||||||||||||||||||||||||||||||||||||
def sz = count_table.size() | ||||||||||||||||||||||||||||||||||||||||||
def mb = 1024 * 1024 | ||||||||||||||||||||||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -67,28 +67,17 @@ print('Date and time: ' + str(datetime.datetime.now())) | |||||||
# 4. Loading data | ||||||||
|
||||||||
## reading combined repertoire statistics | ||||||||
df = pd.read_csv(sample_stats_csv, sep=',', header=None, | ||||||||
names=['sample_id', 'patient_id', 'timepoint', 'origin', | ||||||||
'num_clones', 'num_TCRs', 'simpson_index', 'simpson_index_corrected', 'clonality', | ||||||||
'num_prod', 'num_nonprod', 'pct_prod', 'pct_nonprod', | ||||||||
'productive_cdr3_avg_len', 'num_convergent', 'ratio_convergent']) | ||||||||
df = pd.read_csv(sample_stats_csv, sep=',') | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previously the file was read with header=None and an explicit column name list; removing that implies the CSV now contains a header row, but upstream changes do not show evidence that a header has been added. If the file still lacks a header, the first data row will be misinterpreted as column names—retain header=None with explicit names or confirm the writer now emits headers.
Suggested change
Copilot uses AI. Check for mistakes. Positive FeedbackNegative Feedback There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These headers are now present in the
|
||||||||
# print('-- Imported sample_stats_csv as `df`...') | ||||||||
|
||||||||
## reading sample metadata | ||||||||
meta = pd.read_csv(sample_table, sep=',', header=None, index_col=None, | ||||||||
names=['sample_id', 'file', 'patient_id', 'timepoint', 'origin']) | ||||||||
names=['sample', 'file', 'subject_id', 'timepoint', 'origin']) | ||||||||
# print('-- Imported sample_table as `meta`...') | ||||||||
|
||||||||
## reading V gene family usage | ||||||||
v_family = pd.read_csv(v_family_csv, sep=',', header=None, index_col=None, | ||||||||
names=['patient_id', 'timepoint', 'origin', 'TCRBV01', | ||||||||
'TCRBV02', 'TCRBV03', 'TCRBV04', 'TCRBV05', 'TCRBV06', | ||||||||
'TCRBV07', 'TCRBV08', 'TCRBV09', 'TCRBV10', 'TCRBV11', | ||||||||
'TCRBV12', 'TCRBV13', 'TCRBV14', 'TCRBV15', 'TCRBV16', | ||||||||
'TCRBV17', 'TCRBV18', 'TCRBV19', 'TCRBV20', 'TCRBV21', | ||||||||
'TCRBV22', 'TCRBV23', 'TCRBV24', 'TCRBV25', 'TCRBV26', | ||||||||
'TCRBV27', 'TCRBV28', 'TCRBV29', 'TCRBV30']) | ||||||||
v_family = v_family.sort_values(by=['patient_id', 'timepoint']) | ||||||||
v_family = pd.read_csv(v_family_csv, sep=',') | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The earlier version explicitly supplied the V gene column names with header=None; switching to default header handling assumes the aggregated v_family CSV now has a header row. If it does not, sorting by 'subject_id' will raise a KeyError because that column name will instead be a data value—either restore header=None with explicit names or ensure headers are written upstream.
Suggested change
Copilot uses AI. Check for mistakes. Positive FeedbackNegative Feedback There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The individual stats tables contain the columns that were present for each sample, so I think this is not an issue but rather a feature - column names are not hardcoded anymore.
|
||||||||
v_family = v_family.sort_values(by=['subject_id', 'timepoint']) | ||||||||
``` | ||||||||
|
||||||||
# Sample level statistics {#sec-sample-level-stats} | ||||||||
|
@@ -106,7 +95,7 @@ fig = px.box(df, | |||||||
x = 'timepoint', | ||||||||
y='num_clones', | ||||||||
color='origin', | ||||||||
points='all', hover_data=['sample_id'], | ||||||||
points='all', hover_data=['sample'], | ||||||||
category_orders={'timepoint': timepts}) | ||||||||
fig.show() | ||||||||
``` | ||||||||
|
@@ -120,7 +109,7 @@ fig = px.box(df, | |||||||
x = 'timepoint', | ||||||||
y='clonality', | ||||||||
color='origin', | ||||||||
points='all', hover_data=['sample_id'], | ||||||||
points='all', hover_data=['sample'], | ||||||||
category_orders={'timepoint': timepts}) | ||||||||
fig.show() | ||||||||
``` | ||||||||
|
@@ -134,7 +123,7 @@ fig = px.box(df, | |||||||
x = 'timepoint', | ||||||||
y='simpson_index_corrected', | ||||||||
color='origin', | ||||||||
points='all', hover_data=['sample_id'], | ||||||||
points='all', hover_data=['sample'], | ||||||||
category_orders={'timepoint': timepts}) | ||||||||
fig.show() | ||||||||
``` | ||||||||
|
@@ -152,7 +141,7 @@ fig = px.box(df, | |||||||
x = 'timepoint', | ||||||||
y='pct_prod', | ||||||||
color='origin', | ||||||||
points='all', hover_data=['sample_id'], | ||||||||
points='all', hover_data=['sample'], | ||||||||
category_orders={'timepoint': timepts}) | ||||||||
fig.show() | ||||||||
``` | ||||||||
|
@@ -170,7 +159,7 @@ fig = px.box(df, | |||||||
x = 'timepoint', | ||||||||
y='productive_cdr3_avg_len', | ||||||||
color='origin', | ||||||||
points='all', hover_data=['sample_id'], | ||||||||
points='all', hover_data=['sample'], | ||||||||
category_orders={'timepoint': timepts}) | ||||||||
fig.show() | ||||||||
``` | ||||||||
|
@@ -184,7 +173,7 @@ fig = px.box(df, | |||||||
x = 'timepoint', | ||||||||
y='ratio_convergent', | ||||||||
color='origin', | ||||||||
points='all', hover_data=['sample_id'], | ||||||||
points='all', hover_data=['sample'], | ||||||||
category_orders={'timepoint': timepts}) | ||||||||
fig.show() | ||||||||
``` | ||||||||
|
@@ -210,16 +199,16 @@ where $N_{k}$ is the number of TCRs that use the $k$ th V gene, and T is the tot | |||||||
colors = ["#fafa70","#fdef6b","#ffe566","#ffda63","#ffd061","#ffc660","#ffbb5f","#fdb15f","#fba860","#f79e61","#f39562","#ef8c63","#e98365","#e37b66","#dd7367","#d66b68","#ce6469","#c65e6a","#bd576b","#b4526b","#ab4c6b","#a1476a","#974369","#8c3e68","#823a66","#773764","#6d3361","#62305e","#572c5a","#4d2956"] | ||||||||
|
||||||||
## calculate calulate proportions and add to v_family_long | ||||||||
v_family_long = pd.melt(v_family, id_vars=['patient_id', 'timepoint', 'origin'], value_vars=v_family.columns[3:], var_name='v_gene', value_name='count') | ||||||||
v_family_long['proportion'] = v_family_long.groupby(['patient_id', 'timepoint', 'origin'])['count'].transform(lambda x: x / x.sum()) | ||||||||
v_family_long = pd.melt(v_family, id_vars=['subject_id', 'timepoint', 'origin'], value_vars=v_family.columns[3:], var_name='v_gene', value_name='count') | ||||||||
v_family_long['proportion'] = v_family_long.groupby(['subject_id', 'timepoint', 'origin'])['count'].transform(lambda x: x / x.sum()) | ||||||||
|
||||||||
## add in the total number of v genes for each sample | ||||||||
total_v_genes = v_family_long.groupby(['patient_id', 'timepoint', 'origin'])['count'].sum().reset_index() | ||||||||
total_v_genes.columns = ['patient_id', 'timepoint', 'origin', 'total_v_genes'] | ||||||||
v_family_long = pd.merge(v_family_long, total_v_genes, on=['patient_id', 'timepoint', 'origin']) | ||||||||
total_v_genes = v_family_long.groupby(['subject_id', 'timepoint', 'origin'])['count'].sum().reset_index() | ||||||||
total_v_genes.columns = ['subject_id', 'timepoint', 'origin', 'total_v_genes'] | ||||||||
v_family_long = pd.merge(v_family_long, total_v_genes, on=['subject_id', 'timepoint', 'origin']) | ||||||||
|
||||||||
for patient in v_family_long.patient_id.unique().tolist(): | ||||||||
current = v_family_long[v_family_long.patient_id == patient] | ||||||||
for patient in v_family_long.subject_id.unique().tolist(): | ||||||||
current = v_family_long[v_family_long.subject_id == patient] | ||||||||
fig = go.Figure() | ||||||||
fig.update_layout( | ||||||||
template="simple_white", | ||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregation does not enforce a deterministic ordering of rows (prior implementation used sort:true), so run-to-run concatenation order may vary depending on file staging order. Apply a consistent sort (e.g., sort input_files list or sort merged by key columns) before writing to ensure reproducible outputs.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if concatenation order is important as stated by Copilot above, but is there a reason why we need SAMPLE_AGGREGATE as separate python-based process as opposed to seemingly simpler collectFile operator?