Skip to content

Conversation

@ddehueck
Copy link
Contributor

_Users_Devin_Documents_github-repositories_devins-oxbow-fork_oxbow_docs__build_html_examples_sam_mark_duplicates html

https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_integrated.20130502.genotypes.bcf.csi
https://oxbow-ngs.s3.us-east-2.amazonaws.com/valid.bigWig
https://oxbow-ngs.s3.us-east-2.amazonaws.com/small.bigBed
https://figshare.com/ndownloader/files/2133236
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the best way to bring this sample data in - it's about 11MB and found it here: https://figshare.com/articles/dataset/Example_SAM_file/1460716

Happy to use something different if there are any recommendations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I might just add it to the oxbow bucket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://oxbow-ngs.s3.us-east-2.amazonaws.com/Col0_C1.100k.sam

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used jupytext to build the .md file from this. Not sure if this should be committed or not.

Copy link
Member

@nvictus nvictus Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to commit this if the .md can be roundtripped to ipynb correctly with jupytext, which I was able to do with enough metadata in the YAML frontmatter.

# If the bit 0x10 is set, the read is on the reverse strand
STRAND_BIT = 0x10
def get_unclipped_5_prime_start_position(row) -> int:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a correctness check as this has all been pretty new territory for me :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Procedurally, this looks correct to me. The only difference I see from the reference implementation is that they include hard clips ("H"), whereas this seems to only account for soft clipping ("S").

"\nTotal pair duplicates:",
total_pair_dups,
)
```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left the output as just the total count - should I add any other summary stats? Maybe mark duplicates in the original df's flag col and show the distribution of duplicates?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we could "mark" or even remove duplicates by carrying the full alignments through the pipeline and then exploding / unnesting the original fields at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., assuming we passed the alignments records through, you could re-flatten them like this:

best_pairs_df.select(
    "qname", "alignments"
).explode("alignments").select(
    pl.col("alignments").struct.unnest()
)

@ddehueck ddehueck marked this pull request as ready for review June 22, 2025 23:58
Copy link
Member

@nvictus nvictus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that this took so long to get to. It's fantastic! I left some comments and suggestions. Thanks again for making the contribution!

return pos + aligned_length + trailing_soft_clips - 1
df = df.with_columns(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might throw in all the required fields upfront. Strand could be represented using the conventional +/- notation people are used to.

df = df.with_columns(
    pl.struct(["pos", "cigar", "flag"])
    .map_elements(get_unclipped_5_prime_start_position, return_dtype=pl.Int64)
    .alias("5p_start"),

    pl.when((pl.col("flag") & STRAND_BIT) == 0)
    .then(pl.lit("+"))
    .otherwise(pl.lit("-"))
    .alias("strand"),

    pl.col("qual").map_elements(get_quality_score_sum, return_dtype=pl.Int64)
    .alias("total_quality")
)

- Quality scores

```{code-cell} ipython3
pairs_df = df.group_by("qname").agg(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to carry the original alignment records through a single pass:

df.group_by("qname").agg(
    [
        pl.col("rname").alias("rnames"),
        pl.col("5p_start").alias("5p_starts"),
        pl.col("strand").alias("strands"),
        pl.col("total_quality").alias("total_qualities"),
        pl.struct("*").alias("alignments")
    ]
)

Comment on lines +157 to +162
pl.col("quals")
.map_elements(
lambda qlist: sum(get_quality_score_sum(q) for q in qlist),
return_dtype=pl.Int64
)
.alias("total_quality"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be dropped if we calculate total_qualities as previously suggested.


```{code-cell} ipython3
# Resolve duplicate pairs by sorting by total quality and taking the best pair
best_pairs_df = pairs_df.sort("total_quality", descending=True).unique(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this on a large dataset, sorting everything by total quality could be very costly. Often bams are coordinate sorted already, so maybe we sort first by dedup key, then by total_quality to minimize the shuffle?

"\nTotal pair duplicates:",
total_pair_dups,
)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., assuming we passed the alignments records through, you could re-flatten them like this:

best_pairs_df.select(
    "qname", "alignments"
).explode("alignments").select(
    pl.col("alignments").struct.unnest()
)

https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_integrated.20130502.genotypes.bcf.csi
https://oxbow-ngs.s3.us-east-2.amazonaws.com/valid.bigWig
https://oxbow-ngs.s3.us-east-2.amazonaws.com/small.bigBed
https://figshare.com/ndownloader/files/2133236
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I might just add it to the oxbow bucket.

---
jupytext:
text_representation:
format_name: myst
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try to convert this back into ipynb with jupytext, the code cells end up just being text.

Somehow, it works correctly by just adding this extra line of metadata:

Suggested change
format_name: myst
extension: .md
format_name: myst

https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_integrated.20130502.genotypes.bcf.csi
https://oxbow-ngs.s3.us-east-2.amazonaws.com/valid.bigWig
https://oxbow-ngs.s3.us-east-2.amazonaws.com/small.bigBed
https://figshare.com/ndownloader/files/2133236
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://oxbow-ngs.s3.us-east-2.amazonaws.com/Col0_C1.100k.sam

"\nTotal pair duplicates:",
total_pair_dups,
)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to show this chained into a full pipeline on a lazyframe at the end.

ds = ox.from_sam("data/Col0_C1.100k.sam")

ldf = ds.to_polars(lazy=True).with_columns(
    pl.struct(["pos", "cigar", "flag"])
    .map_elements(get_unclipped_5_prime_start_position, return_dtype=pl.Int64)
    .alias("5p_start"),

    pl.when((pl.col("flag") & STRAND_BIT) == 0)
    .then(pl.lit("+"))
    .otherwise(pl.lit("-"))
    .alias("strand"),

    pl.col("qual").map_elements(get_quality_score_sum, return_dtype=pl.Int64)
    .alias("total_quality")
).group_by("qname").agg(
    [
        pl.col("rname").alias("rnames"),
        pl.col("5p_start").alias("5p_starts"),
        pl.col("strand").alias("strands"),
        pl.col("total_quality").alias("total_qualities"),
        pl.struct(ds.schema.names).alias("alignments")
    ]
).with_columns(
    pl.struct(["rnames", "5p_starts", "strands"])
    .map_elements(
        lambda s: build_dedup_key(s["rnames"], s["5p_starts"], s["strands"]),
        return_dtype=pl.String
    )
    .alias("dedup_key"),
).filter(
    pl.col("dedup_key").is_not_null()
).sort(
    ["dedup_key", "total_qualities"], descending=True
).unique(
    subset=["dedup_key"]
).select(
    "qname", "alignments"
).explode(
    "alignments"
).select(
    pl.col("alignments").struct.unnest()
)

ldf

# If the bit 0x10 is set, the read is on the reverse strand
STRAND_BIT = 0x10
def get_unclipped_5_prime_start_position(row) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Procedurally, this looks correct to me. The only difference I see from the reference implementation is that they include hard clips ("H"), whereas this seems to only account for soft clipping ("S").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants