Skip to content

splitCsv with skip parameter does not work as expected when file contains a blank line #6422

@fschwach

Description

@fschwach

Bug report

I'm trying to parse the output of the mummer tool "show-coords" (tabular format) using splitCsv.
splitCsv puts an element in its output channel that reflects the header as a data row, regardless of the setting of the "skip" parameter

Expected behavior and actual behavior

Providing a header an the correct number of lines to skip at the beginning of the file (or csv string) should output one element per row of data and the header row from the file/string should be skipped. This does not happen. The header row is treated as a row of data.

Steps to reproduce the problem

This code snippet demonstrates the issue. The string provided is a representation of a mummer show-coords tabular output file. I have also tested this with a file input and the result is the same. Note that the show-coords format has an empty line before the header row ("[S1]...."). This appears to be the issue.
The command should skip the first 4 lines (including the header), then apply the header provided as parameter to splitCsv and return a channel with 3 items. But, as you can see in the below output, it produces 4 items, the first one treats the header as a data row.

channel.of(
        ["file1.fa	file2.fa\nNUCMER\n\n[S1]	[E1]	[S2]	[E2]	[LEN 1]	[LEN 2]	[% IDY]	[LEN R]	[LEN Q]	[COV R]	[COV Q]	[TAGS]\n12	522	1	511	511	511	100.00	2341	511	21.83	100.00	seq1	seq2\n421	491	165	95	71	71	98.59	2341	215	3.03	33.02	seq3	seq4\n470	574	1	105	105	105	100.00	2341	105	4.49	100.00	seq5	seq6"]
    )
    .splitCsv( 
            skip: 4,
            header: ['S1','E1','S2','E2','LEN 1','LEN 2','% IDY','LEN R','LEN Q','COV R','COV Q','TAG1','TAG2'], 
            sep: "\t"
            ).view()

Program output

[[S1:[S1], E1:[E1], S2:[S2], E2:[E2], LEN 1:[LEN 1], LEN 2:[LEN 2], % IDY:[% IDY], LEN R:[LEN R], LEN Q:[LEN Q], COV R:[COV R], COV Q:[COV Q], TAG1:[TAGS], TAG2:null]]
[[S1:12, E1:522, S2:1, E2:511, LEN 1:511, LEN 2:511, % IDY:100.00, LEN R:2341, LEN Q:511, COV R:21.83, COV Q:100.00, TAG1:seq1, TAG2:seq2]]
[[S1:421, E1:491, S2:165, E2:95, LEN 1:71, LEN 2:71, % IDY:98.59, LEN R:2341, LEN Q:215, COV R:3.03, COV Q:33.02, TAG1:seq3, TAG2:seq4]]
[[S1:470, E1:574, S2:1, E2:105, LEN 1:105, LEN 2:105, % IDY:100.00, LEN R:2341, LEN Q:105, COV R:4.49, COV Q:100.00, TAG1:seq5, TAG2:seq6]]

Demonstrating the presumed cause

It seems the empty line is the problem. This modified version of the csv data replaces the empty line 3 with some bogus text. We have the same number of lines but now the output is correct:

channel.of(
        ["file1.fa	file2.fa\nNUCMER\nMAKE THIS LINE NOT EMPTY\n[S1]	[E1]	[S2]	[E2]	[LEN 1]	[LEN 2]	[% IDY]	[LEN R]	[LEN Q]	[COV R]	[COV Q]	[TAGS]\n12	522	1	511	511	511	100.00	2341	511	21.83	100.00	seq1	seq2\n421	491	165	95	71	71	98.59	2341	215	3.03	33.02	seq3	seq4\n470	574	1	105	105	105	100.00	2341	105	4.49	100.00	seq5	seq6"]
    )
    .splitCsv( 
            skip: 4,
            header: ['S1','E1','S2','E2','LEN 1','LEN 2','% IDY','LEN R','LEN Q','COV R','COV Q','TAG1','TAG2'], 
            sep: "\t"
            ).view()

output:

[[S1:12, E1:522, S2:1, E2:511, LEN 1:511, LEN 2:511, % IDY:100.00, LEN R:2341, LEN Q:511, COV R:21.83, COV Q:100.00, TAG1:seq1, TAG2:seq2]]
[[S1:421, E1:491, S2:165, E2:95, LEN 1:71, LEN 2:71, % IDY:98.59, LEN R:2341, LEN Q:215, COV R:3.03, COV Q:33.02, TAG1:seq3, TAG2:seq4]]
[[S1:470, E1:574, S2:1, E2:105, LEN 1:105, LEN 2:105, % IDY:100.00, LEN R:2341, LEN Q:105, COV R:4.49, COV Q:100.00, TAG1:seq5, TAG2:seq6]]

Whereas changing the skip to "3" (while keeping the empty line in the csv input) has no effect and still outputs 4 items, including the one that treats the header as data.

Why not just use skip:3, header: true to let splitCsv figure out the header? Because, unfortunately, the mummer show-coords tabular output format uses a tab in the last data column (TAGS). Since there is no matching header for that column, the data is not captured. This is exactly the data item I need to capture.

Environment

  • Nextflow version: 25.04.6
  • Java version: OpenJDK Runtime Environment (build 17.0.15+6-Ubuntu-0ubuntu120.04)
  • Operating system: Linux
  • Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)

Additional context

AS pointed out above, I originally noticed this behaviour when reading from a csv file. I am providing a csv text string here, but the same error happens with the original file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions