WDL best practice guidelines

Input and output files are the central construct of orchestrating high-performance data parallel analysis workflows for the underlying computing and storage infrastructure. When writing your WDL workflow scripts, we require certain guidelines and best practices, particularly File-typed input and output variables, to enable scalable workload-driven process automation and production on the SeqsLab platform, while reducing the unnecessary complex workflow design and ensuring the integrity of workflow execution.

Input and output files

All the variables of input and output files and directories use the File type, which allows SeqsLab to automatically identify workflow and task variables for linking datasets between the local filesystem and cloud storage, and applying dynamic data parallel processing automation.

Compound types with File-typed variables

Use the Array compound type to represent a list of input files or nested input files, e.g., Array[File] or Array[Array[File]]. The SeqsLab workflow execution service does not currently support File-typed variables declared in other compound types, such as Map, Pair, Object, and Struct.

Command standard output and error

SeqsLab automatically logs all command outputs for stdout and stderr, and stores them in the given storage infrastructure (e.g., your Azure Blob Storage account) for audit trail purposes. As such, there is no need to explicitly define this in the command outputs. As for the requirements of workflow integrity and reproducibility, we do not recommend relying on the command standard output or error messages in your WDL workflow execution.

Nested scatter calls

Since SeqsLab provides implicit and dynamic data parallel processing (file data partitioning), it eliminates the complexity of parallelizing your workloads at WDL scripts. It renders designing scatter calls and manual data partitioning unnecessary, thereby improving readability and performance. We recommend that you use scatter call block ONLY for running the same subworkflow or task on multiple input files (e.g. FASTQ or BAM files) in parallel, and to avoid using nested scatter blocks (i.e., scatter block in another scatter block) in the same workflow.

In the following example, a scatter block for partitioning a BAM file is nested in a scatter block for parallelizing the processing of multiple FASTQ input files.

scatter (idx in range(length(inputFastqs))) {
  Array[File] reads = inputFastqs[idx]
  
  call BwaMem {
    input:
      readR1 = reads[0],
      readR2 = reads[1],
      refFa = refFa
  }

  call PartitionBam {
    input:
      partitionBed = WgsCallableRegions,
      inputBam = BwaMem.outputBam
  }
  
  scatter (inputBam in PartitionBam.outputBam) {
    call VariantCalling {
      input:
        inputBam = inputBam,
        refFa = refFa
    }
  }
  
  call MergeVcfs {
    input:
      inputVcfs = VariantCalling.outputVcf
  }
}

With SeqsLab implicit and dynamic data parallel processing, you can simplify your WDL script by eliminating the extra effort of designing and writing scatter-gather pipelines to improve the overall workflow performance. The following is an example of how to do this.

scatter (idx in range(length(inputFastqs))) {
  Array[File] reads = inputFastqs[idx]
  
  call BwaMem {
    input:
      readR1 = reads[0],
      readR2 = reads[1],
      refFa = refFa
  }
  
  call VariantCalling {
    input:
      inputBam = BwaMem.outputBam,
      refFa = refFa
  }
}

Output files

It is good practice to explicitly define which files to output in the WDL task section. SeqsLab automatically helps you collect and save those output files in the underlying storage infrastructure. When you want to output an entire directory, the directory will be archived into a single archive file for output. However, note that this practice is not recommended.

command <<<
  myprogram --outdir "~{outputDir}" "~{inputBam}"
  tar -zcf "~{outputDir}.tgz" "~{outputDir}"
>>>

output {
  File outputArchive = "~{outputDir}.tgz"
}

Command section style

Use triple angle brackets <<< … >>> to enclose the command section and denote expression placeholders with ~{…} to eliminate confusion with the Bash script expression ${}. For example, a command might reference an input to the task in this way:

command <<<
  set -e -o pipefail
  samtools index "~{processedBam}"
>>>

Import WDL scripts

The SeqsLab platform provides GA4GH (Global Alliance for Genomics and Health) Tool Registry Service APIs to manage your workflow assets (e.g., WDL scripts, input JSON files, container images, etc.) for the production requirements of regulatory compliance, code integrity, as well as version control. When importing other WDL files being registered as the same Tool Version, always use relative paths that are relative to the location of the main WDL file.
In the following example, we are importing alignment.wdl in the directory subworkflow, which is located in the same hierarchical path of the main workflow my_main_workflow.wdl.

/root/MyMainWorkflow.wdl
/root/SubWorkflow/Alignment.wdl
version 1.0
import "SubWorkflow/Alignment.wdl" as WfAlign

workflow MyMainWorkflow {
  ...
}