Skip to content

Improve reference data practices #7

@dleehr

Description

@dleehr

Our workflows often rely on reference datasets that we mount into the VM from a private NFS server. Basically, everything in this file with a /data/ prefix. Example below:

      "path": "/data/exome-seq/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar"
      "path": "/data/exome-seq/b37/Mills_and_1000G_gold_standard.indels.b37.vcf"
        "path": "/data/exome-seq/capture/xgen-exome-research-panel-targetsae255a1532796e2eaa53ff00001c1b3c-trimmed-chr.bed"
        "path": "/data/exome-seq/b37/dbsnp_138.b37.vcf"
        "path": "/data/exome-seq/b37/Mills_and_1000G_gold_standard.indels.b37.vcf"
        "path": "/data/exome-seq/b37/1000G_phase1.indels.b37.vcf"
        "path": "/data/exome-seq/capture/xgen-exome-research-panel-probesbe255a1532796e2eaa53ff00001c1b3c-trimmed-chr.bed"
      "path": "/data/exome-seq/b37/decoy/human_g1k_v37_decoy.fasta"
      "path": "/data/exome-seq/b37/dbsnp_138.b37.vcf"
      "path": "/data/exome-seq/b37/1000G_phase1.snps.high_confidence.b37.vcf"
      "path": "/data/exome-seq/b37/hapmap/hapmap_3.3.b37.vcf"
      "path": "/data/exome-seq/b37/omni/1000G_omni2.5.b37.vcf"

While some of the referenced datasets may seem obvious to those with domain expertise, their provenance is not made explicit. We also do not provide checksums, file sizes, or access to these files.

Let's come up with a strategy to address these shortcomings

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions