-
Notifications
You must be signed in to change notification settings - Fork 735
Description
Bug report
I'm trying to parse the output of the mummer tool "show-coords" (tabular format) using splitCsv.
splitCsv puts an element in its output channel that reflects the header as a data row, regardless of the setting of the "skip" parameter
Expected behavior and actual behavior
Providing a header an the correct number of lines to skip at the beginning of the file (or csv string) should output one element per row of data and the header row from the file/string should be skipped. This does not happen. The header row is treated as a row of data.
Steps to reproduce the problem
This code snippet demonstrates the issue. The string provided is a representation of a mummer show-coords tabular output file. I have also tested this with a file input and the result is the same. Note that the show-coords format has an empty line before the header row ("[S1]...."). This appears to be the issue.
The command should skip the first 4 lines (including the header), then apply the header provided as parameter to splitCsv and return a channel with 3 items. But, as you can see in the below output, it produces 4 items, the first one treats the header as a data row.
channel.of(
["file1.fa file2.fa\nNUCMER\n\n[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [% IDY] [LEN R] [LEN Q] [COV R] [COV Q] [TAGS]\n12 522 1 511 511 511 100.00 2341 511 21.83 100.00 seq1 seq2\n421 491 165 95 71 71 98.59 2341 215 3.03 33.02 seq3 seq4\n470 574 1 105 105 105 100.00 2341 105 4.49 100.00 seq5 seq6"]
)
.splitCsv(
skip: 4,
header: ['S1','E1','S2','E2','LEN 1','LEN 2','% IDY','LEN R','LEN Q','COV R','COV Q','TAG1','TAG2'],
sep: "\t"
).view()
Program output
[[S1:[S1], E1:[E1], S2:[S2], E2:[E2], LEN 1:[LEN 1], LEN 2:[LEN 2], % IDY:[% IDY], LEN R:[LEN R], LEN Q:[LEN Q], COV R:[COV R], COV Q:[COV Q], TAG1:[TAGS], TAG2:null]]
[[S1:12, E1:522, S2:1, E2:511, LEN 1:511, LEN 2:511, % IDY:100.00, LEN R:2341, LEN Q:511, COV R:21.83, COV Q:100.00, TAG1:seq1, TAG2:seq2]]
[[S1:421, E1:491, S2:165, E2:95, LEN 1:71, LEN 2:71, % IDY:98.59, LEN R:2341, LEN Q:215, COV R:3.03, COV Q:33.02, TAG1:seq3, TAG2:seq4]]
[[S1:470, E1:574, S2:1, E2:105, LEN 1:105, LEN 2:105, % IDY:100.00, LEN R:2341, LEN Q:105, COV R:4.49, COV Q:100.00, TAG1:seq5, TAG2:seq6]]
Demonstrating the presumed cause
It seems the empty line is the problem. This modified version of the csv data replaces the empty line 3 with some bogus text. We have the same number of lines but now the output is correct:
channel.of(
["file1.fa file2.fa\nNUCMER\nMAKE THIS LINE NOT EMPTY\n[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [% IDY] [LEN R] [LEN Q] [COV R] [COV Q] [TAGS]\n12 522 1 511 511 511 100.00 2341 511 21.83 100.00 seq1 seq2\n421 491 165 95 71 71 98.59 2341 215 3.03 33.02 seq3 seq4\n470 574 1 105 105 105 100.00 2341 105 4.49 100.00 seq5 seq6"]
)
.splitCsv(
skip: 4,
header: ['S1','E1','S2','E2','LEN 1','LEN 2','% IDY','LEN R','LEN Q','COV R','COV Q','TAG1','TAG2'],
sep: "\t"
).view()
output:
[[S1:12, E1:522, S2:1, E2:511, LEN 1:511, LEN 2:511, % IDY:100.00, LEN R:2341, LEN Q:511, COV R:21.83, COV Q:100.00, TAG1:seq1, TAG2:seq2]]
[[S1:421, E1:491, S2:165, E2:95, LEN 1:71, LEN 2:71, % IDY:98.59, LEN R:2341, LEN Q:215, COV R:3.03, COV Q:33.02, TAG1:seq3, TAG2:seq4]]
[[S1:470, E1:574, S2:1, E2:105, LEN 1:105, LEN 2:105, % IDY:100.00, LEN R:2341, LEN Q:105, COV R:4.49, COV Q:100.00, TAG1:seq5, TAG2:seq6]]
Whereas changing the skip to "3" (while keeping the empty line in the csv input) has no effect and still outputs 4 items, including the one that treats the header as data.
Why not just use skip:3, header: true
to let splitCsv figure out the header? Because, unfortunately, the mummer show-coords tabular output format uses a tab in the last data column (TAGS). Since there is no matching header for that column, the data is not captured. This is exactly the data item I need to capture.
Environment
- Nextflow version: 25.04.6
- Java version: OpenJDK Runtime Environment (build 17.0.15+6-Ubuntu-0ubuntu120.04)
- Operating system: Linux
- Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Additional context
AS pointed out above, I originally noticed this behaviour when reading from a csv file. I am providing a csv text string here, but the same error happens with the original file.