Skip to content

Commit df0e2e5

Browse files
committed
Updated with new features!
1 parent 729a8b6 commit df0e2e5

File tree

1 file changed

+16
-15
lines changed

1 file changed

+16
-15
lines changed

README.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,19 @@ Originally designed to count patterns in DNA sequences with ambiguous bases as d
99
1010
## Usage
1111

12-
> [!IMPORTANT]
13-
> While the query sequences can all be different lengths, currently `Pattern Buffer` only supports counting these patterns on sequences that are all the same length.
14-
1512
`Pattern Buffer` can be used with a broadly functional or object-oriented (OO) interface, with the functional interface geared towards one-time use and the OO interface for use multiple times (i.e. file parsing or PyTorch DataLoaders).
1613

1714
To demonstrate, we'll first create some sample sequences and queries. As we're using IUPAC nucleotide sequences, we can use the provided `generate_iupac_embedding` function to provide the embedding tensor.
1815

1916
```python
2017
from pattern_buffer import generate_iupac_embedding
21-
sequences = ["AACGAATCAAAAT", "AACAGTTCAAAAT", "AACAGTTCGYGGA", "AACAAGATCAGGA"]
22-
queries = ["AAA", "AGT", "AAACA", "AAR", "GYGGA"]
18+
sequences = [
19+
"AACGAATCAAAAT",
20+
"AACAGTTCAAAAATTAGT",
21+
"AGTTCGYGGA",
22+
"AACAAGATCAGGAAAGCTGACTTGATG",
23+
]
24+
query_seqs = ["AAA", "AGT", "AAACA", "AAR", "GYGGA"]
2325
embedding = generate_iupac_embedding()
2426
```
2527

@@ -29,10 +31,10 @@ call:
2931
```python
3032
from pattern_buffer import count_queries
3133
count_queries(sequences, queries, embedding)
32-
# tensor([[2, 0, 0, 2, 0],
33-
# [2, 1, 0, 2, 0],
34-
# [0, 1, 0, 0, 0],
35-
# [0, 0, 0, 1, 0]])
34+
# tensor([[0, 0, 0, 0, 0],
35+
# [3, 1, 0, 3, 0],
36+
# [0, 1, 0, 0, 1],
37+
# [1, 0, 0, 3, 0]])
3638
```
3739

3840
or create an instance of the `PatternBuffer` class, and use the `.count` method to count occurrences in new sequences. This has the advantage of not re-calculating the query embeddings or the `support` tensor each time, so is well suited for fast repeated counting:
@@ -41,20 +43,19 @@ or create an instance of the `PatternBuffer` class, and use the `.count` method
4143
from pattern_buffer import PatternBuffer
4244
pb = PatternBuffer(query_strings=queries, embedding=embedding)
4345
pb.count(sequences)
44-
# tensor([[2, 0, 0, 2, 0],
45-
# [2, 1, 0, 2, 0],
46-
# [0, 1, 0, 0, 0],
47-
# [0, 0, 0, 1, 0]])
46+
# tensor([[0, 0, 0, 0, 0],
47+
# [3, 1, 0, 3, 0],
48+
# [0, 1, 0, 0, 1],
49+
# [1, 0, 0, 3, 0]])
4850
```
4951

5052
## Limitations
5153

52-
- Currently, the program expects all input sequences to have the same length, but queries can already be different lengths.
5354
- If all of your patterns contain unique (non-ambiguous) characters then this encoding scheme is likely overkill and other software would be more efficient.
5455
- The software is designed for use with GPU acceleration, and will likely under-perform on CPU when compared to CPU-optimised counting schemes.
5556

5657
## Future work
5758

58-
- [ ] Allow dynamic input lengths with padding
59+
- [x] Allow dynamic input lengths with padding
5960
- [ ] Automatic encoding detection from pattern analysis
6061
- [ ] FFT-based convolutions for long patterns

0 commit comments

Comments
 (0)