WDL best practice guidelines

WDL best practice guidelines#

When writing your WDL workflow scripts, we require that you follow the following guidelines and best practices to enable scalable workload-driven process automation and production on the SeqsLab platform. Doing so also helps reduce unnecessary complexities in the workflow design and ensures the integrity of workflow execution.

Input and output files#

Use File-typed input and output variables. The input and output files are the central construct in orchestrating high-performance data parallel analysis workflows for the underlying computing and storage infrastructure. All the variables of input and output files and directories use the File type, which allows SeqsLab to automatically identify workflow and task variables for linking datasets between the local filesystem and cloud storage, and applying dynamic data parallel processing automation.

Compound types with File-typed variables#

Use the Array compound type to represent a list of input files or nested input files, e.g., Array[File] or Array[Array[File]]. The SeqsLab workflow execution service does not currently support File-typed variables declared in other compound types, such as Map, Pair, Object, and Struct.

Command standard output and error#

SeqsLab automatically logs all command outputs for stdout and stderr, and stores them in the given storage infrastructure (e.g., your Azure Blob Storage account) for audit trail purposes. As such, there is no need to explicitly define these in the command outputs. As for the requirements of workflow integrity and reproducibility, we do not recommend relying on the command standard output or error messages in your WDL workflow execution.

Nested scatter calls#

Since SeqsLab provides implicit and dynamic data parallel processing (file data partitioning), it eliminates the complexity of parallelizing your workloads at WDL scripts. It renders designing scatter calls and manual data partitioning unnecessary, thereby improving readability and performance. We recommend that you use scatter call block ONLY for running the same sub-workflow or task on multiple input files (e.g. FASTQ or BAM files) in parallel, and to avoid using nested scatter blocks (i.e., scatter block in another scatter block) in the same workflow.

In the following example, a scatter block for partitioning a BAM file is nested in a scatter block for parallelizing the processing of multiple FASTQ input files.

scatter (idx in range(length(inputFastqs))) {
  Array[File] reads = inputFastqs[idx]
  
  call BwaMem {
    input:
      readR1 = reads[0],
      readR2 = reads[1],
      refFa = refFa
  }

  call PartitionBam {
    input:
      partitionBed = WgsCallableRegions,
      inputBam = BwaMem.outputBam
  }
  
  scatter (inputBam in PartitionBam.outputBam) {
    call VariantCalling {
      input:
        inputBam = inputBam,
        refFa = refFa
    }
  }
  
  call MergeVcfs {
    input:
      inputVcfs = VariantCalling.outputVcf
  }
}

With SeqsLab implicit and dynamic data parallel processing, you can simplify your WDL script by eliminating the extra effort of designing and writing scatter-gather pipelines to improve the overall workflow performance. The following is an example of how to do this.

scatter (idx in range(length(inputFastqs))) {
  Array[File] reads = inputFastqs[idx]
  
  call BwaMem {
    input:
      readR1 = reads[0],
      readR2 = reads[1],
      refFa = refFa
  }
  
  call VariantCalling {
    input:
      inputBam = BwaMem.outputBam,
      refFa = refFa
  }
}

Output files#

It is good practice to explicitly define which files to output in the WDL task section. SeqsLab automatically helps you collect and save those output files in the underlying storage infrastructure. When you want to output an entire directory, the directory will be archived into a single archive file for output. However, note that this practice is not recommended.

command <<<
  myprogram --outdir "~{outputDir}" "~{inputBam}"
  tar -zcf "~{outputDir}.tgz" "~{outputDir}"
>>>

output {
  File outputArchive = "~{outputDir}.tgz"
}

Command section style#

Use triple angle brackets <<< … >>> to enclose the command section and denote expression placeholders with ~{…} to eliminate confusion with the Bash script expression ${}. For example, a command might reference an input to the task in this way:

command <<<
  set -e -o pipefail
  samtools index "~{processedBam}"
>>>

Import WDL scripts#

The SeqsLab platform provides a GA4GH (Global Alliance for Genomics and Health) Tool Registry Service API to manage your workflow assets (e.g., WDL scripts, input JSON files, container images, etc.) for the production requirements of regulatory compliance, code integrity, as well as version control. When importing other WDL files that are being registered as the same Tool Version, always use relative paths that are relative to the location of the main WDL file. In the following example, we are importing alignment.wdl in the directory subworkflow, which is located in the same hierarchical path of the main workflow my_main_workflow.wdl.

/root/MyMainWorkflow.wdl
/root/SubWorkflow/Alignment.wdl

version 1.0
import "SubWorkflow/Alignment.wdl" as WfAlign

workflow MyMainWorkflow {
  ...
}