Skip to content

Commit f359daf

Browse files
authored
feat: switch to fastp for trimming (#155)
Switching to `fastp` should speed up the adapter trimming and allows for auto-detection of many adapter sequences. This pull request also moves from per-project adapter definitions to per-uni adapter definitions, allowing to combine data from different sequencing experiments. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added support for configuring adapter trimming and extra options for sequencing reads using `fastp` via new schema fields. * **Bug Fixes** * None. * **Documentation** * Updated and expanded documentation to detail configuration of input files and new `fastp`-based trimming, replacing previous `cutadapt` instructions. * **Refactor** * Switched trimming workflow from `cutadapt` to `fastp`, including renaming and restructuring rules and outputs. * Updated file and log output paths to consistently organize results within sample-specific directories. * Simplified configuration and schema files by removing all `cutadapt`-related parameters and sections. * Improved file path consistency across multiple workflow rules by nesting outputs within sample-specific subdirectories. * Changed quality control report categories from "QC" to "quality control" for clarity. * Updated GitHub Actions workflows to use newer action versions for improved CI/CD processes. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent 9964ecf commit f359daf

File tree

17 files changed

+269
-283
lines changed

17 files changed

+269
-283
lines changed

.github/workflows/main.yml

Lines changed: 76 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -41,26 +41,26 @@ jobs:
4141
formatting:
4242
runs-on: ubuntu-latest
4343
steps:
44-
- uses: actions/checkout@v3
45-
with:
46-
fetch-depth: 0
47-
- name: Formatting
48-
uses: github/super-linter@v5
49-
env:
50-
VALIDATE_ALL_CODEBASE: false
51-
DEFAULT_BRANCH: main
52-
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
53-
VALIDATE_SNAKEMAKE_SNAKEFMT: true
44+
- uses: actions/checkout@v4
45+
with:
46+
fetch-depth: 0
47+
- name: Formatting
48+
uses: github/super-linter@v7
49+
env:
50+
VALIDATE_ALL_CODEBASE: false
51+
DEFAULT_BRANCH: main
52+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
53+
VALIDATE_SNAKEMAKE_SNAKEFMT: true
5454
linting:
5555
runs-on: ubuntu-latest
5656
steps:
57-
- uses: actions/checkout@v3
58-
- name: Linting
59-
uses: snakemake/snakemake-github-action@v1
60-
with:
61-
directory: .test
62-
snakefile: workflow/Snakefile
63-
args: "--lint"
57+
- uses: actions/checkout@v4
58+
- name: Linting
59+
uses: snakemake/snakemake-github-action@v2
60+
with:
61+
directory: .test
62+
snakefile: workflow/Snakefile
63+
args: "--lint"
6464

6565
run-rna-workflow:
6666
runs-on: ubuntu-latest
@@ -69,72 +69,72 @@ jobs:
6969
- formatting
7070
steps:
7171

72-
- name: Free Disk Space (Ubuntu)
73-
uses: jlumbroso/[email protected]
74-
with:
75-
# this might remove tools that are actually needed,
76-
# if set to "true" but frees about 6 GB
77-
tool-cache: false
78-
79-
# all of these default to true, but feel free to set to
80-
# "false" if necessary for your workflow
81-
android: true
82-
dotnet: true
83-
haskell: true
84-
large-packages: true
85-
docker-images: false
86-
swap-storage: true
72+
- name: Free Disk Space (Ubuntu)
73+
uses: jlumbroso/[email protected]
74+
with:
75+
# this might remove tools that are actually needed,
76+
# if set to "true" but frees about 6 GB
77+
tool-cache: false
8778

88-
- name: Checkout repository
89-
uses: actions/checkout@v3
90-
with:
91-
submodules: recursive
79+
# all of these default to true, but feel free to set to
80+
# "false" if necessary for your workflow
81+
android: true
82+
dotnet: true
83+
haskell: true
84+
large-packages: true
85+
docker-images: false
86+
swap-storage: true
87+
88+
- name: Checkout repository
89+
uses: actions/checkout@v4
90+
with:
91+
submodules: recursive
92+
93+
- name: Test workflow
94+
uses: snakemake/snakemake-github-action@v2
95+
with:
96+
directory: .test
97+
snakefile: workflow/Snakefile
98+
args: "--use-conda --show-failed-logs --cores all --conda-cleanup-pkgs cache --all-temp"
9299

93-
- name: Test workflow
94-
uses: snakemake/snakemake-github-action@v1
95-
with:
96-
directory: .test
97-
snakefile: workflow/Snakefile
98-
args: "--use-conda --show-failed-logs --cores all --conda-cleanup-pkgs cache --all-temp"
99-
100100
run-three-prime-rna-workflow:
101101
runs-on: ubuntu-latest
102102
needs:
103103
- linting
104104
- formatting
105105
steps:
106-
107-
- name: Free Disk Space (Ubuntu)
108-
uses: jlumbroso/[email protected]
109-
with:
110-
# this might remove tools that are actually needed,
111-
# if set to "true" but frees about 6 GB
112-
tool-cache: false
113-
114-
# all of these default to true, but feel free to set to
115-
# "false" if necessary for your workflow
116-
android: true
117-
dotnet: true
118-
haskell: true
119-
large-packages: true
120-
docker-images: false
121-
swap-storage: true
122106

123-
- name: Checkout repository
124-
uses: actions/checkout@v3
107+
- name: Free Disk Space (Ubuntu)
108+
uses: jlumbroso/[email protected]
109+
with:
110+
# this might remove tools that are actually needed,
111+
# if set to "true" but frees about 6 GB
112+
tool-cache: false
113+
114+
# all of these default to true, but feel free to set to
115+
# "false" if necessary for your workflow
116+
android: true
117+
dotnet: true
118+
haskell: true
119+
large-packages: true
120+
docker-images: false
121+
swap-storage: true
122+
123+
- name: Checkout repository
124+
uses: actions/checkout@v4
125125

126-
- name: Test 3-prime-workflow
127-
uses: snakemake/snakemake-github-action@v1
128-
with:
129-
directory: .test/three_prime
130-
snakefile: .test/three_prime/workflow/Snakefile
131-
args: "--use-conda --show-failed-logs --cores all --conda-cleanup-pkgs cache --all-temp"
132-
# Disable report testing for now since we mark all output files as temporary above.
133-
# TODO: add some kind of test mode to report generation which does not really try to include
134-
# results.
135-
# - name: Test report
136-
# uses: snakemake/snakemake-github-action@v1
137-
# with:
138-
# directory: .test
139-
# snakefile: workflow/Snakefile
140-
# args: "--report report.zip"
126+
- name: Test 3-prime-workflow
127+
uses: snakemake/snakemake-github-action@v2
128+
with:
129+
directory: .test/three_prime
130+
snakefile: .test/three_prime/workflow/Snakefile
131+
args: "--use-conda --show-failed-logs --cores all --conda-cleanup-pkgs cache --all-temp"
132+
# Disable report testing for now since we mark all output files as temporary above.
133+
# TODO: add some kind of test mode to report generation which does not really try to include
134+
# results.
135+
# - name: Test report
136+
# uses: snakemake/snakemake-github-action@v1
137+
# with:
138+
# directory: .test
139+
# snakefile: workflow/Snakefile
140+
# args: "--report report.zip"

.test/config/config.yaml

Lines changed: 0 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -147,30 +147,3 @@ params:
147147
# If you want to decrease this for larger datasets, there paper and
148148
# [a reply on GitHub suggest a value of `-b 30`](https://github.com/pachterlab/kallisto/issues/353#issuecomment-1215742328).
149149
kallisto: "-b 30"
150-
151-
# these cutadapt parameters need to contain the required flag(s) for
152-
# the type of adapter(s) to trim, i.e.:
153-
# * https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types
154-
# * `-a` for 3' adapter in the forward reads
155-
# * `-g` for 5' adapter in the forward reads
156-
# * `-b` for adapters anywhere in the forward reads
157-
# also, separate capitalised letter flags are required for adapters in
158-
# the reverse reads of paired end sequencing
159-
#
160-
# reasoning behind parameters:
161-
# * https://cutadapt.readthedocs.io/en/stable/guide.html#trimming-paired-end-reads
162-
# * `--minimum-length 33`:
163-
# * kallisto needs non-empty reads in current versions (fixed for future releases:
164-
# https://github.com/pachterlab/kallisto/commit/64fe837ca86f3664496483bcd2787c9376584fed)
165-
# * kallisto default k-mer length is 31 and 33 should give at least 3 k-mers for a read
166-
# * `-e 0.005`: the default cutadapt maximum error rate of `0.2` is far too high, for Illumina
167-
# data the error rate is more in the range of `0.005` and setting it accordingly should avoid
168-
# false positive adapter matches
169-
# * `--minimum-overlap 7`: the cutadapt default minimum overlap of `5` did trimming on the level
170-
# of expected adapter matches by chance
171-
cutadapt-se:
172-
adapters: "-a ACGGATCGATCGATCGATCGAT -g GGATCGATCGATCGATCGAT "
173-
extra: "--minimum-length 33 -e 0.005 --overlap 7"
174-
cutadapt-pe:
175-
adapters: "-a ACGGATCGATCGATCGATCGAT -g GGATCGATCGATCGATCGAT -A ACGGATCGATCGATCGATCGAT -G GGATCGATCGATCGATCGAT"
176-
extra: "--minimum-length 33 -e 0.005 --overlap 7"

.test/config/units.tsv

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
sample unit fragment_len_mean fragment_len_sd fq1 fq2
2-
A 1 ngs-test-data/reads/a.chr21.1.fq ngs-test-data/reads/a.chr21.2.fq
3-
B 1 ngs-test-data/reads/b.chr21.1.fq ngs-test-data/reads/b.chr21.2.fq
4-
B 2 300 14 ngs-test-data/reads/b.chr21.1.fq
5-
C 1 ngs-test-data/reads/a.chr21.1.fq ngs-test-data/reads/a.chr21.2.fq
6-
D 1 ngs-test-data/reads/b.chr21.1.fq ngs-test-data/reads/b.chr21.2.fq
1+
sample unit fragment_len_mean fragment_len_sd fq1 fq2 fastp_adapters fastp_extra
2+
A 1 ngs-test-data/reads/a.chr21.1.fq ngs-test-data/reads/a.chr21.2.fq
3+
B 1 ngs-test-data/reads/b.chr21.1.fq ngs-test-data/reads/b.chr21.2.fq
4+
B 2 300 14 ngs-test-data/reads/b.chr21.1.fq
5+
C 1 ngs-test-data/reads/a.chr21.1.fq ngs-test-data/reads/a.chr21.2.fq
6+
D 1 ngs-test-data/reads/b.chr21.1.fq ngs-test-data/reads/b.chr21.2.fq

.test/three_prime/config/config.yaml

Lines changed: 0 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -141,33 +141,3 @@ params:
141141
# If you want to decrease this for larger datasets, there paper and
142142
# [a reply on GitHub suggest a value of `-b 30`](https://github.com/pachterlab/kallisto/issues/353#issuecomment-1215742328).
143143
kallisto: "-b 30"
144-
145-
# these cutadapt parameters need to contain the required flag(s) for
146-
# the type of adapter(s) to trim, i.e.:
147-
# * https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types
148-
# * `-a` for 3' adapter in the forward reads
149-
# * `-g` for 5' adapter in the forward reads
150-
# * `-b` for adapters anywhere in the forward reads
151-
# also, separate capitalised letter flags are required for adapters in
152-
# the reverse reads of paired end sequencing
153-
#
154-
# reasoning behind parameters:
155-
# * https://cutadapt.readthedocs.io/en/stable/guide.html#trimming-paired-end-reads
156-
# * `--minimum-length 33`:
157-
# * kallisto needs non-empty reads in current versions (fixed for future releases:
158-
# https://github.com/pachterlab/kallisto/commit/64fe837ca86f3664496483bcd2787c9376584fed)
159-
# * kallisto default k-mer length is 31 and 33 should give at least 3 k-mers for a read
160-
# * `-e 0.005`: the default cutadapt maximum error rate of `0.2` is far too high, for Illumina
161-
# data the error rate is more in the range of `0.005` and setting it accordingly should avoid
162-
# false positive adapter matches
163-
# * `--minimum-overlap 7`: the cutadapt default minimum overlap of `5` did trimming on the level
164-
# of expected adapter matches by chance
165-
cutadapt-se:
166-
# This setup is for Lexogen QuantSeq FWD data, based on (but simplfied):
167-
# https://faqs.lexogen.com/faq/what-is-the-adapter-sequence-i-need-to-use-for-t-1
168-
# For more details, see the Lexogen 3' QuantSeq section in the `config/README.md` file.
169-
adapters: "-a 'r1adapter=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC;min_overlap=7;max_error_rate=0.005'"
170-
extra: "--minimum-length 33 --nextseq-trim=20 --poly-a"
171-
cutadapt-pe:
172-
adapters: ""
173-
extra: ""

.test/three_prime/config/units.tsv

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
sample unit fragment_len_mean fragment_len_sd fq1 fq2 bam_single bam_paired
2-
SRR8309096 u1 430 43 quant_seq_test_data/SRR8309096.fastq.gz
3-
SRR8309094 u1 430 43 quant_seq_test_data/SRR8309094.fastq.gz
4-
SRR8309095 u1 430 43 quant_seq_test_data/SRR8309095.fastq.gz
5-
SRR8309097 u1 430 43 quant_seq_test_data/SRR8309097.fastq.gz
6-
SRR8309098 u1 430 43 quant_seq_test_data/SRR8309098.fastq.gz
7-
SRR8309099 u1 430 43 quant_seq_test_data/SRR8309099.fastq.gz
1+
sample unit fragment_len_mean fragment_len_sd fq1 fq2 bam_single bam_paired fastp_adapters fastp_extra
2+
SRR8309096 u1 430 43 quant_seq_test_data/SRR8309096.fastq.gz
3+
SRR8309094 u1 430 43 quant_seq_test_data/SRR8309094.fastq.gz
4+
SRR8309095 u1 430 43 quant_seq_test_data/SRR8309095.fastq.gz
5+
SRR8309097 u1 430 43 quant_seq_test_data/SRR8309097.fastq.gz
6+
SRR8309098 u1 430 43 quant_seq_test_data/SRR8309098.fastq.gz
7+
SRR8309099 u1 430 43 quant_seq_test_data/SRR8309099.fastq.gz

0 commit comments

Comments
 (0)