-
Notifications
You must be signed in to change notification settings - Fork 13
Mark duplicates example on SAM file #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| ../../fixtures |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # Examples | ||
|
|
||
|
|
||
| ```{attention} | ||
| This section is still under construction. | ||
| ``` | ||
|
|
||
| ```{toctree} | ||
| :maxdepth: 2 | ||
|
|
||
| sam_mark_duplicates | ||
| ``` | ||
|
|
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,186 @@ | ||||||||
| --- | ||||||||
| jupytext: | ||||||||
| text_representation: | ||||||||
| format_name: myst | ||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When I try to convert this back into ipynb with jupytext, the code cells end up just being text. Somehow, it works correctly by just adding this extra line of metadata:
Suggested change
|
||||||||
| kernelspec: | ||||||||
| display_name: Python 3 | ||||||||
| name: python3 | ||||||||
| --- | ||||||||
|
|
||||||||
| # Marking SAM read pair duplicates with oxbow | ||||||||
|
|
||||||||
| ```{code-cell} ipython3 | ||||||||
| import oxbow as ox | ||||||||
| import polars as pl | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## Loading the SAM File | ||||||||
| Let's grab a sample SAM file | ||||||||
|
|
||||||||
| ```{code-cell} ipython3 | ||||||||
| # Let's use oxbow to read in the SAM file as a polars dataframe | ||||||||
| df = ox.from_sam("data/2133236").to_polars() | ||||||||
| df.head() | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## Compute 5' start position | ||||||||
|
|
||||||||
| ```{code-cell} ipython3 | ||||||||
| def parse_cigar(cigar_str: str) -> list[tuple[str, int]]: | ||||||||
| """Parse the CIGAR string into a list of tuples (operation, length) | ||||||||
| Example: | ||||||||
| >>> parse_cigar("76M") | ||||||||
| [('M', 76)] | ||||||||
| >>> parse_cigar("10M1I65M") | ||||||||
| [('M', 10), ('I', 1), ('M', 65)] | ||||||||
| """ | ||||||||
| result = [] | ||||||||
| current_number = "" | ||||||||
| for char in cigar_str: | ||||||||
| if char.isdigit(): | ||||||||
| current_number += char | ||||||||
| else: | ||||||||
| if current_number: | ||||||||
| result.append((char, int(current_number))) | ||||||||
| current_number = "" | ||||||||
| return result | ||||||||
| # If the bit 0x10 is set, the read is on the reverse strand | ||||||||
| STRAND_BIT = 0x10 | ||||||||
| def get_unclipped_5_prime_start_position(row) -> int: | ||||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could use a correctness check as this has all been pretty new territory for me :)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Procedurally, this looks correct to me. The only difference I see from the reference implementation is that they include hard clips ("H"), whereas this seems to only account for soft clipping ("S"). |
||||||||
| """ | ||||||||
| Get the unclipped 5′ start position from the CIGAR string | ||||||||
| Args: | ||||||||
| row: Row from the SAM file | ||||||||
| Returns: | ||||||||
| int: Unclipped 5′ start position | ||||||||
| """ | ||||||||
| pos = row["pos"] | ||||||||
| cigar = row["cigar"] | ||||||||
| flag = row["flag"] | ||||||||
| # Parse CIGAR string | ||||||||
| cigar_ops = parse_cigar(cigar) | ||||||||
| # Check if reverse strand (bit 0x10 set) | ||||||||
| is_reverse = flag & STRAND_BIT | ||||||||
| if not is_reverse: | ||||||||
| # Forward strand: 5′ end = POS - (number of leading soft-clipped bases) | ||||||||
| leading_soft_clips = 0 | ||||||||
| for op, length in cigar_ops: | ||||||||
| if op == "S": | ||||||||
| leading_soft_clips += length | ||||||||
| else: | ||||||||
| break # Stop at first non-S operation | ||||||||
| return pos - leading_soft_clips | ||||||||
| else: | ||||||||
| # Reverse strand: 5′ end is at POS + (aligned length) + (trailing soft-clipped bases) - 1 | ||||||||
| aligned_length = 0 | ||||||||
| trailing_soft_clips = 0 | ||||||||
| # Calculate aligned length (M, =, X, D, N operations) | ||||||||
| for op, length in cigar_ops: | ||||||||
| if op in ["M", "=", "X", "D", "N"]: | ||||||||
| aligned_length += length | ||||||||
| # Find trailing soft clips (S operations at the end) | ||||||||
| for op, length in reversed(cigar_ops): | ||||||||
| if op == "S": | ||||||||
| trailing_soft_clips += length | ||||||||
| else: | ||||||||
| break # Stop at first non-S operation from the end | ||||||||
| return pos + aligned_length + trailing_soft_clips - 1 | ||||||||
| df = df.with_columns( | ||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might throw in all the required fields upfront. Strand could be represented using the conventional +/- notation people are used to. df = df.with_columns(
pl.struct(["pos", "cigar", "flag"])
.map_elements(get_unclipped_5_prime_start_position, return_dtype=pl.Int64)
.alias("5p_start"),
pl.when((pl.col("flag") & STRAND_BIT) == 0)
.then(pl.lit("+"))
.otherwise(pl.lit("-"))
.alias("strand"),
pl.col("qual").map_elements(get_quality_score_sum, return_dtype=pl.Int64)
.alias("total_quality")
) |
||||||||
| pl.struct(["pos", "cigar", "flag"]) | ||||||||
| .map_elements(get_unclipped_5_prime_start_position) | ||||||||
| .alias("unclipped_5p_start_pos") | ||||||||
| ) | ||||||||
| df.head() | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## Group the reads into pairs | ||||||||
| We assume that the qname corresponds to read pairs. We'll also be sure to include the information needed to deduplicate read pairs: | ||||||||
| - Reference genome name | ||||||||
| - 5' start positions | ||||||||
| - Strand flags | ||||||||
| - Quality scores | ||||||||
|
|
||||||||
| ```{code-cell} ipython3 | ||||||||
| pairs_df = df.group_by("qname").agg( | ||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we want to carry the original alignment records through a single pass: df.group_by("qname").agg(
[
pl.col("rname").alias("rnames"),
pl.col("5p_start").alias("5p_starts"),
pl.col("strand").alias("strands"),
pl.col("total_quality").alias("total_qualities"),
pl.struct("*").alias("alignments")
]
) |
||||||||
| [ | ||||||||
| pl.col("rname").alias("rnames"), | ||||||||
| pl.col("unclipped_5p_start_pos").alias("5p_positions"), | ||||||||
| ((pl.col("flag") & STRAND_BIT) != 0).alias("strands"), | ||||||||
| pl.col("qual").alias("quals"), | ||||||||
| ] | ||||||||
| ) | ||||||||
| pairs_df.head() | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## Process pair reads | ||||||||
|
|
||||||||
| Build a deduplication key and sum the quality scores for each pair (for resolution in the next step) | ||||||||
|
|
||||||||
| ```{code-cell} ipython3 | ||||||||
| def get_quality_score_sum(qual_str): | ||||||||
| """Calculate the sum of quality scores from a string of quality scores""" | ||||||||
| return sum(ord(c) - 33 for c in qual_str if c != " ") | ||||||||
| def build_dedup_key(rnames, positions, strands): | ||||||||
| """Makes a dedup key for a read pair""" | ||||||||
| items = sorted(zip(rnames, positions, strands)) | ||||||||
| if len(items) < 2: | ||||||||
| print(f"WARNING: read is missing pair: {items}") | ||||||||
| return None | ||||||||
| return f"{items[0][0]}:{items[0][1]}:{items[0][2]}__{items[1][0]}:{items[1][1]}:{items[1][2]}" | ||||||||
| pairs_df = pairs_df.with_columns( | ||||||||
| pl.struct(["rnames", "5p_positions", "strands"]) | ||||||||
| .map_elements( | ||||||||
| lambda s: build_dedup_key(s["rnames"], s["5p_positions"], s["strands"]), | ||||||||
| return_dtype=pl.String | ||||||||
| ) | ||||||||
| .alias("dedup_key"), | ||||||||
| pl.col("quals") | ||||||||
| .map_elements( | ||||||||
| lambda qlist: sum(get_quality_score_sum(q) for q in qlist), | ||||||||
| return_dtype=pl.Int64 | ||||||||
| ) | ||||||||
| .alias("total_quality"), | ||||||||
|
Comment on lines
+157
to
+162
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could be dropped if we calculate |
||||||||
| ).filter(pl.col("dedup_key").is_not_null()) | ||||||||
| pairs_df[('dedup_key', 'total_quality')].head() | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## Count the duplicates | ||||||||
|
|
||||||||
| We choose the best read pair across duplicates by the best quality score total | ||||||||
|
|
||||||||
| ```{code-cell} ipython3 | ||||||||
| # Resolve duplicate pairs by sorting by total quality and taking the best pair | ||||||||
| best_pairs_df = pairs_df.sort("total_quality", descending=True).unique( | ||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about this on a large dataset, sorting everything by total quality could be very costly. Often bams are coordinate sorted already, so maybe we sort first by dedup key, then by total_quality to minimize the shuffle? |
||||||||
| subset=["dedup_key"] | ||||||||
| ) | ||||||||
| # Get the total number of duplicates | ||||||||
| total_pair_dups = pairs_df.height - best_pairs_df.height | ||||||||
| print( | ||||||||
| best_pairs_df.head(), | ||||||||
| "\nTotal pair duplicates:", | ||||||||
| total_pair_dups, | ||||||||
| ) | ||||||||
| ``` | ||||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Left the output as just the total count - should I add any other summary stats? Maybe mark duplicates in the original df's flag col and show the distribution of duplicates?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think we could "mark" or even remove duplicates by carrying the full alignments through the pipeline and then exploding / unnesting the original fields at the end.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. e.g., assuming we passed the alignments records through, you could re-flatten them like this: best_pairs_df.select(
"qname", "alignments"
).explode("alignments").select(
pl.col("alignments").struct.unnest()
)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be nice to show this chained into a full pipeline on a lazyframe at the end. ds = ox.from_sam("data/Col0_C1.100k.sam")
ldf = ds.to_polars(lazy=True).with_columns(
pl.struct(["pos", "cigar", "flag"])
.map_elements(get_unclipped_5_prime_start_position, return_dtype=pl.Int64)
.alias("5p_start"),
pl.when((pl.col("flag") & STRAND_BIT) == 0)
.then(pl.lit("+"))
.otherwise(pl.lit("-"))
.alias("strand"),
pl.col("qual").map_elements(get_quality_score_sum, return_dtype=pl.Int64)
.alias("total_quality")
).group_by("qname").agg(
[
pl.col("rname").alias("rnames"),
pl.col("5p_start").alias("5p_starts"),
pl.col("strand").alias("strands"),
pl.col("total_quality").alias("total_qualities"),
pl.struct(ds.schema.names).alias("alignments")
]
).with_columns(
pl.struct(["rnames", "5p_starts", "strands"])
.map_elements(
lambda s: build_dedup_key(s["rnames"], s["5p_starts"], s["strands"]),
return_dtype=pl.String
)
.alias("dedup_key"),
).filter(
pl.col("dedup_key").is_not_null()
).sort(
["dedup_key", "total_qualities"], descending=True
).unique(
subset=["dedup_key"]
).select(
"qname", "alignments"
).explode(
"alignments"
).select(
pl.col("alignments").struct.unnest()
)
ldf |
||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,3 +6,4 @@ https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_in | |
| https://oxbow-ngs.s3.us-east-2.amazonaws.com/ALL.chrY.phase3_shapeit2_mvncall_integrated.20130502.genotypes.bcf.csi | ||
| https://oxbow-ngs.s3.us-east-2.amazonaws.com/valid.bigWig | ||
| https://oxbow-ngs.s3.us-east-2.amazonaws.com/small.bigBed | ||
| https://figshare.com/ndownloader/files/2133236 | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure this is the best way to bring this sample data in - it's about 11MB and found it here: https://figshare.com/articles/dataset/Example_SAM_file/1460716 Happy to use something different if there are any recommendations.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks good! I might just add it to the oxbow bucket.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
Uh oh!
There was an error while loading. Please reload this page.