Mark duplicates example on SAM file #123

ddehueck · 2025-06-22T23:54:16Z

Added an example of finding duplicate read pairs in a SAM file.
Added an example section to the docs and an example page for this example
Validated the result with picard (https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates)

docs/conf.py

ddehueck · 2025-06-22T23:55:31Z

fixtures/list.txt

 https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_integrated.20130502.genotypes.bcf.csi
 https://oxbow-ngs.s3.us-east-2.amazonaws.com/valid.bigWig
 https://oxbow-ngs.s3.us-east-2.amazonaws.com/small.bigBed
+https://figshare.com/ndownloader/files/2133236


Not sure this is the best way to bring this sample data in - it's about 11MB and found it here: https://figshare.com/articles/dataset/Example_SAM_file/1460716

Happy to use something different if there are any recommendations.

This looks good! I might just add it to the oxbow bucket.

https://oxbow-ngs.s3.us-east-2.amazonaws.com/Col0_C1.100k.sam

ddehueck · 2025-06-22T23:55:57Z

py-oxbow/notebooks/example_sam_mark_duplicates.ipynb

Used jupytext to build the .md file from this. Not sure if this should be committed or not.

No need to commit this if the .md can be roundtripped to ipynb correctly with jupytext, which I was able to do with enough metadata in the YAML frontmatter.

ddehueck · 2025-06-22T23:56:44Z

docs/examples/sam_mark_duplicates.md

+# If the bit 0x10 is set, the read is on the reverse strand
+STRAND_BIT = 0x10
+
+def get_unclipped_5_prime_start_position(row) -> int:


This could use a correctness check as this has all been pretty new territory for me :)

Procedurally, this looks correct to me. The only difference I see from the reference implementation is that they include hard clips ("H"), whereas this seems to only account for soft clipping ("S").

ddehueck · 2025-06-22T23:58:41Z

docs/examples/sam_mark_duplicates.md

+    "\nTotal pair duplicates:",
+    total_pair_dups,
+)
+```


Left the output as just the total count - should I add any other summary stats? Maybe mark duplicates in the original df's flag col and show the distribution of duplicates?

Yeah, I think we could "mark" or even remove duplicates by carrying the full alignments through the pipeline and then exploding / unnesting the original fields at the end.

e.g., assuming we passed the alignments records through, you could re-flatten them like this:

best_pairs_df.select( "qname", "alignments" ).explode("alignments").select( pl.col("alignments").struct.unnest() )

nvictus

Sorry that this took so long to get to. It's fantastic! I left some comments and suggestions. Thanks again for making the contribution!

nvictus · 2025-11-05T17:46:49Z

docs/examples/sam_mark_duplicates.md

+
+        return pos + aligned_length + trailing_soft_clips - 1
+    
+df = df.with_columns(


I might throw in all the required fields upfront. Strand could be represented using the conventional +/- notation people are used to.

df = df.with_columns( pl.struct(["pos", "cigar", "flag"]) .map_elements(get_unclipped_5_prime_start_position, return_dtype=pl.Int64) .alias("5p_start"), pl.when((pl.col("flag") & STRAND_BIT) == 0) .then(pl.lit("+")) .otherwise(pl.lit("-")) .alias("strand"), pl.col("qual").map_elements(get_quality_score_sum, return_dtype=pl.Int64) .alias("total_quality") )

nvictus · 2025-11-05T17:49:22Z

docs/examples/sam_mark_duplicates.md

+- Quality scores
+
+```{code-cell} ipython3
+pairs_df = df.group_by("qname").agg(


If we want to carry the original alignment records through a single pass:

df.group_by("qname").agg( [ pl.col("rname").alias("rnames"), pl.col("5p_start").alias("5p_starts"), pl.col("strand").alias("strands"), pl.col("total_quality").alias("total_qualities"), pl.struct("*").alias("alignments") ] )

nvictus · 2025-11-05T17:52:25Z

docs/examples/sam_mark_duplicates.md

+    pl.col("quals")
+    .map_elements(
+        lambda qlist: sum(get_quality_score_sum(q) for q in qlist), 
+        return_dtype=pl.Int64
+    )
+    .alias("total_quality"),


This could be dropped if we calculate total_qualities as previously suggested.

nvictus · 2025-11-05T17:55:46Z

docs/examples/sam_mark_duplicates.md

+
+```{code-cell} ipython3
+# Resolve duplicate pairs by sorting by total quality and taking the best pair
+best_pairs_df = pairs_df.sort("total_quality", descending=True).unique(


Thinking about this on a large dataset, sorting everything by total quality could be very costly. Often bams are coordinate sorted already, so maybe we sort first by dedup key, then by total_quality to minimize the shuffle?

nvictus · 2025-11-05T17:56:52Z

docs/examples/sam_mark_duplicates.md

+    "\nTotal pair duplicates:",
+    total_pair_dups,
+)
+```


e.g., assuming we passed the alignments records through, you could re-flatten them like this:

best_pairs_df.select( "qname", "alignments" ).explode("alignments").select( pl.col("alignments").struct.unnest() )

nvictus · 2025-11-05T19:43:52Z

fixtures/list.txt

 https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_integrated.20130502.genotypes.bcf.csi
 https://oxbow-ngs.s3.us-east-2.amazonaws.com/valid.bigWig
 https://oxbow-ngs.s3.us-east-2.amazonaws.com/small.bigBed
+https://figshare.com/ndownloader/files/2133236


This looks good! I might just add it to the oxbow bucket.

nvictus · 2025-11-05T19:46:01Z

docs/examples/sam_mark_duplicates.md

+---
+jupytext:
+  text_representation:
+    format_name: myst


When I try to convert this back into ipynb with jupytext, the code cells end up just being text.

Somehow, it works correctly by just adding this extra line of metadata:

Suggested change

format_name: myst

extension: .md

format_name: myst

nvictus · 2025-11-05T20:16:15Z

fixtures/list.txt

 https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_integrated.20130502.genotypes.bcf.csi
 https://oxbow-ngs.s3.us-east-2.amazonaws.com/valid.bigWig
 https://oxbow-ngs.s3.us-east-2.amazonaws.com/small.bigBed
+https://figshare.com/ndownloader/files/2133236


https://oxbow-ngs.s3.us-east-2.amazonaws.com/Col0_C1.100k.sam

nvictus · 2025-11-05T20:18:49Z

docs/examples/sam_mark_duplicates.md

+    "\nTotal pair duplicates:",
+    total_pair_dups,
+)
+```


It might be nice to show this chained into a full pipeline on a lazyframe at the end.

ds = ox.from_sam("data/Col0_C1.100k.sam") ldf = ds.to_polars(lazy=True).with_columns( pl.struct(["pos", "cigar", "flag"]) .map_elements(get_unclipped_5_prime_start_position, return_dtype=pl.Int64) .alias("5p_start"), pl.when((pl.col("flag") & STRAND_BIT) == 0) .then(pl.lit("+")) .otherwise(pl.lit("-")) .alias("strand"), pl.col("qual").map_elements(get_quality_score_sum, return_dtype=pl.Int64) .alias("total_quality") ).group_by("qname").agg( [ pl.col("rname").alias("rnames"), pl.col("5p_start").alias("5p_starts"), pl.col("strand").alias("strands"), pl.col("total_quality").alias("total_qualities"), pl.struct(ds.schema.names).alias("alignments") ] ).with_columns( pl.struct(["rnames", "5p_starts", "strands"]) .map_elements( lambda s: build_dedup_key(s["rnames"], s["5p_starts"], s["strands"]), return_dtype=pl.String ) .alias("dedup_key"), ).filter( pl.col("dedup_key").is_not_null() ).sort( ["dedup_key", "total_qualities"], descending=True ).unique( subset=["dedup_key"] ).select( "qname", "alignments" ).explode( "alignments" ).select( pl.col("alignments").struct.unnest() ) ldf

nvictus · 2025-11-05T20:38:04Z

docs/examples/sam_mark_duplicates.md

+# If the bit 0x10 is set, the read is on the reverse strand
+STRAND_BIT = 0x10
+
+def get_unclipped_5_prime_start_position(row) -> int:


Procedurally, this looks correct to me. The only difference I see from the reference implementation is that they include hard clips ("H"), whereas this seems to only account for soft clipping ("S").

ddehueck added 3 commits June 22, 2025 18:59

Add pair read deduplication example notebook

8235e1f

Add examples section to doc with pair duplication

7e84ac4

remove raw data file

417a65b

ddehueck commented Jun 22, 2025

View reviewed changes

docs/conf.py Show resolved Hide resolved

ddehueck commented Jun 22, 2025

View reviewed changes

ddehueck marked this pull request as ready for review June 22, 2025 23:58

run format

82c0231

nvictus requested changes Nov 5, 2025

View reviewed changes


		return pos + aligned_length + trailing_soft_clips - 1

		df = df.with_columns(

Mark duplicates example on SAM file #123

Are you sure you want to change the base?

Mark duplicates example on SAM file #123

Uh oh!

Conversation

ddehueck commented Jun 22, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvictus Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvictus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nvictus Nov 5, 2025 •

edited

Loading