Skip to content

Best practices for multiple process inputs #4311

@bentsherman

Description

@bentsherman

Is your feature request related to a problem? Please describe

Howdy folks

I'd like to discuss a common convention I see with nf-core modules, where a process has two separate inputs for e.g. a sample and an index. Here are a few examples I found:

So you have two inputs:

    input:
    tuple val(meta), path(reads)
    tuple val(meta2), path(index)

This convention works fine as long as you have a single index, in which case you can provide the index as a value channel and it will be "broadcast" to every sample, basically an implicit cross product.

But what if you have multiple indices? The process inputs are not really set up to handle this, so you have to do a bit of hacking:

ch_samples = Channel.of( /* ... */ )
ch_indices = Channel.of( /* ... */ )

ch_inputs = ch_samples.combine(ch_indices)
ch_multi = ch_inputs.multiMap { it ->
    samples: it[0..2],
    indices: it[2..4]
}

PROC(ch_multi.samples, ch_multi.indices)

But now you're wondering if multiMap preserves the order of its inputs, and that question leads down a deep rabbit hole. I have now led multiple people through that rabbit hole, and every time it leads me back to the original problem of multiple inputs. It's the reason why I added this note to the docs.

It's not always nf-core modules that are the cause, just "someone else's process that I'm trying to re-use". In any case, I'm hoping that I can broach the subject and spread this best practice to the community. Are people aware of this issue? Have you debated over this convention in the past? If so I would prefer to build on whatever previous discussions were had.

By the way, here's how I think you SHOULD do it:

process PROC {
    input:
    tuple val(meta), path(reads), val(meta2), path(index)

    // ...
}

workflow {
    ch_samples = Channel.of( /* ... */ )
    ch_indices = Channel.of( /* ... */ )

    PROC( ch_samples.combine(ch_indices) )
}

Easy! It works in all cases (one-to-one, many-to-one, many-to-many), and it doesn't require you play fast and loose with your dataflow

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Suggestion

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions