-
Notifications
You must be signed in to change notification settings - Fork 13
Mark duplicates example on SAM file #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ddehueck
commented
Jun 22, 2025
- Added an example of finding duplicate read pairs in a SAM file.
- Added an example section to the docs and an example page for this example
- Validated the result with picard (https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates)

| https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_integrated.20130502.genotypes.bcf.csi | ||
| https://oxbow-ngs.s3.us-east-2.amazonaws.com/valid.bigWig | ||
| https://oxbow-ngs.s3.us-east-2.amazonaws.com/small.bigBed | ||
| https://figshare.com/ndownloader/files/2133236 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is the best way to bring this sample data in - it's about 11MB and found it here: https://figshare.com/articles/dataset/Example_SAM_file/1460716
Happy to use something different if there are any recommendations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! I might just add it to the oxbow bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://oxbow-ngs.s3.us-east-2.amazonaws.com/Col0_C1.100k.sam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used jupytext to build the .md file from this. Not sure if this should be committed or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to commit this if the .md can be roundtripped to ipynb correctly with jupytext, which I was able to do with enough metadata in the YAML frontmatter.
| # If the bit 0x10 is set, the read is on the reverse strand | ||
| STRAND_BIT = 0x10 | ||
| def get_unclipped_5_prime_start_position(row) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use a correctness check as this has all been pretty new territory for me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Procedurally, this looks correct to me. The only difference I see from the reference implementation is that they include hard clips ("H"), whereas this seems to only account for soft clipping ("S").
| "\nTotal pair duplicates:", | ||
| total_pair_dups, | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left the output as just the total count - should I add any other summary stats? Maybe mark duplicates in the original df's flag col and show the distribution of duplicates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we could "mark" or even remove duplicates by carrying the full alignments through the pipeline and then exploding / unnesting the original fields at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., assuming we passed the alignments records through, you could re-flatten them like this:
best_pairs_df.select(
"qname", "alignments"
).explode("alignments").select(
pl.col("alignments").struct.unnest()
)
nvictus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that this took so long to get to. It's fantastic! I left some comments and suggestions. Thanks again for making the contribution!
| return pos + aligned_length + trailing_soft_clips - 1 | ||
| df = df.with_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might throw in all the required fields upfront. Strand could be represented using the conventional +/- notation people are used to.
df = df.with_columns(
pl.struct(["pos", "cigar", "flag"])
.map_elements(get_unclipped_5_prime_start_position, return_dtype=pl.Int64)
.alias("5p_start"),
pl.when((pl.col("flag") & STRAND_BIT) == 0)
.then(pl.lit("+"))
.otherwise(pl.lit("-"))
.alias("strand"),
pl.col("qual").map_elements(get_quality_score_sum, return_dtype=pl.Int64)
.alias("total_quality")
)| - Quality scores | ||
|
|
||
| ```{code-cell} ipython3 | ||
| pairs_df = df.group_by("qname").agg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to carry the original alignment records through a single pass:
df.group_by("qname").agg(
[
pl.col("rname").alias("rnames"),
pl.col("5p_start").alias("5p_starts"),
pl.col("strand").alias("strands"),
pl.col("total_quality").alias("total_qualities"),
pl.struct("*").alias("alignments")
]
)| pl.col("quals") | ||
| .map_elements( | ||
| lambda qlist: sum(get_quality_score_sum(q) for q in qlist), | ||
| return_dtype=pl.Int64 | ||
| ) | ||
| .alias("total_quality"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be dropped if we calculate total_qualities as previously suggested.
|
|
||
| ```{code-cell} ipython3 | ||
| # Resolve duplicate pairs by sorting by total quality and taking the best pair | ||
| best_pairs_df = pairs_df.sort("total_quality", descending=True).unique( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this on a large dataset, sorting everything by total quality could be very costly. Often bams are coordinate sorted already, so maybe we sort first by dedup key, then by total_quality to minimize the shuffle?
| "\nTotal pair duplicates:", | ||
| total_pair_dups, | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., assuming we passed the alignments records through, you could re-flatten them like this:
best_pairs_df.select(
"qname", "alignments"
).explode("alignments").select(
pl.col("alignments").struct.unnest()
)| https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_integrated.20130502.genotypes.bcf.csi | ||
| https://oxbow-ngs.s3.us-east-2.amazonaws.com/valid.bigWig | ||
| https://oxbow-ngs.s3.us-east-2.amazonaws.com/small.bigBed | ||
| https://figshare.com/ndownloader/files/2133236 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! I might just add it to the oxbow bucket.
| --- | ||
| jupytext: | ||
| text_representation: | ||
| format_name: myst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I try to convert this back into ipynb with jupytext, the code cells end up just being text.
Somehow, it works correctly by just adding this extra line of metadata:
| format_name: myst | |
| extension: .md | |
| format_name: myst |
| https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_integrated.20130502.genotypes.bcf.csi | ||
| https://oxbow-ngs.s3.us-east-2.amazonaws.com/valid.bigWig | ||
| https://oxbow-ngs.s3.us-east-2.amazonaws.com/small.bigBed | ||
| https://figshare.com/ndownloader/files/2133236 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://oxbow-ngs.s3.us-east-2.amazonaws.com/Col0_C1.100k.sam
| "\nTotal pair duplicates:", | ||
| total_pair_dups, | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to show this chained into a full pipeline on a lazyframe at the end.
ds = ox.from_sam("data/Col0_C1.100k.sam")
ldf = ds.to_polars(lazy=True).with_columns(
pl.struct(["pos", "cigar", "flag"])
.map_elements(get_unclipped_5_prime_start_position, return_dtype=pl.Int64)
.alias("5p_start"),
pl.when((pl.col("flag") & STRAND_BIT) == 0)
.then(pl.lit("+"))
.otherwise(pl.lit("-"))
.alias("strand"),
pl.col("qual").map_elements(get_quality_score_sum, return_dtype=pl.Int64)
.alias("total_quality")
).group_by("qname").agg(
[
pl.col("rname").alias("rnames"),
pl.col("5p_start").alias("5p_starts"),
pl.col("strand").alias("strands"),
pl.col("total_quality").alias("total_qualities"),
pl.struct(ds.schema.names).alias("alignments")
]
).with_columns(
pl.struct(["rnames", "5p_starts", "strands"])
.map_elements(
lambda s: build_dedup_key(s["rnames"], s["5p_starts"], s["strands"]),
return_dtype=pl.String
)
.alias("dedup_key"),
).filter(
pl.col("dedup_key").is_not_null()
).sort(
["dedup_key", "total_qualities"], descending=True
).unique(
subset=["dedup_key"]
).select(
"qname", "alignments"
).explode(
"alignments"
).select(
pl.col("alignments").struct.unnest()
)
ldf| # If the bit 0x10 is set, the read is on the reverse strand | ||
| STRAND_BIT = 0x10 | ||
| def get_unclipped_5_prime_start_position(row) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Procedurally, this looks correct to me. The only difference I see from the reference implementation is that they include hard clips ("H"), whereas this seems to only account for soft clipping ("S").