(concepts:wdl-best-practice)=
# WDL best practice guidelines

When writing your WDL workflow scripts, we require that you follow the following guidelines and best practices to enable scalable workload-driven process automation and production on the SeqsLab platform. Doing so also helps reduce unnecessary complexities in the workflow design and ensures the integrity of workflow execution.

## Input and output files
Use **File**-typed input and output variables. The input and output files are the central construct in orchestrating high-performance data parallel analysis workflows for the underlying computing and storage infrastructure. All the variables of input and output files and directories use the **File** type, which allows SeqsLab to automatically identify workflow and task variables for linking datasets between the local filesystem and cloud storage, and applying dynamic data parallel processing automation.

## Compound types with File-typed variables
Use the **Array** compound type to represent a list of input files or nested input files, e.g., **Array\[File]** or **Array\[Array\[File]]**. The SeqsLab workflow execution service does not currently support File-typed variables declared in other compound types, such as _Map_, _Pair_, _Object_, and _Struct_.

## Command standard output and error
SeqsLab automatically logs all command outputs for `stdout` and `stderr`, and stores them in the given storage infrastructure (e.g., your Azure Blob Storage account) for audit trail purposes. As such, there is no need to explicitly define these
in the command outputs. As for the requirements of workflow integrity and reproducibility, we do not recommend relying on the command standard output or error messages in your WDL workflow execution.

## Nested scatter calls
Since SeqsLab provides implicit and dynamic data parallel processing (file data partitioning), 
it eliminates the complexity of parallelizing your workloads at WDL scripts. It renders designing `scatter` calls 
and manual data partitioning unnecessary, thereby improving readability and performance. We recommend that you use `scatter` call block ONLY for running the same sub-workflow or task on multiple input files (e.g. FASTQ or BAM files)
in parallel, and to avoid using nested scatter blocks (i.e., scatter block in another scatter block) in the same workflow.

In the following example, a scatter block for partitioning a BAM file is nested in a scatter block for parallelizing the processing of multiple FASTQ input files.
```shell
scatter (idx in range(length(inputFastqs))) {
  Array[File] reads = inputFastqs[idx]
  
  call BwaMem {
    input:
      readR1 = reads[0],
      readR2 = reads[1],
      refFa = refFa
  }

  call PartitionBam {
    input:
      partitionBed = WgsCallableRegions,
      inputBam = BwaMem.outputBam
  }
  
  scatter (inputBam in PartitionBam.outputBam) {
    call VariantCalling {
      input:
        inputBam = inputBam,
        refFa = refFa
    }
  }
  
  call MergeVcfs {
    input:
      inputVcfs = VariantCalling.outputVcf
  }
}
```
With SeqsLab [implicit and dynamic data parallel processing](jobs:workflow-engine), you can simplify your WDL script by eliminating the extra effort of designing and writing scatter-gather pipelines to improve the overall workflow performance. The following is an example of how to do this.
```shell
scatter (idx in range(length(inputFastqs))) {
  Array[File] reads = inputFastqs[idx]
  
  call BwaMem {
    input:
      readR1 = reads[0],
      readR2 = reads[1],
      refFa = refFa
  }
  
  call VariantCalling {
    input:
      inputBam = BwaMem.outputBam,
      refFa = refFa
  }
}
```

## Output files
It is good practice to explicitly define which files to output in the WDL task section. SeqsLab automatically helps you collect and save those output files in the underlying storage infrastructure. When you want to output an entire directory, the directory will be archived into a single archive file for output. However, note that this practice is not recommended.
```shell
command <<<
  myprogram --outdir "~{outputDir}" "~{inputBam}"
  tar -zcf "~{outputDir}.tgz" "~{outputDir}"
>>>

output {
  File outputArchive = "~{outputDir}.tgz"
}
```

## Command section style
Use triple angle brackets **<<< ... >>>** to enclose the `command` section and denote expression placeholders with **~{...}** to eliminate confusion with the Bash script expression `${}`. For example, a command might reference an input to the task in this way:
```shell
command <<<
  set -e -o pipefail
  samtools index "~{processedBam}"
>>>
```

## Import WDL scripts
The SeqsLab platform provides a GA4GH (Global Alliance for Genomics and Health) [Tool Registry Service API](api:trs) to manage your workflow assets (e.g., WDL scripts, input JSON files, container images, etc.) for the production requirements of regulatory compliance, code integrity, as well as version control. When importing other WDL files that are being registered as the same _Tool Version_, always use **relative paths** that are relative to the location of the main WDL file. In the following example, we are importing `alignment.wdl` in the directory `subworkflow`, which is located in the same hierarchical path of the main 
workflow `my_main_workflow.wdl`.
```text
/root/MyMainWorkflow.wdl
/root/SubWorkflow/Alignment.wdl
```
```shell
version 1.0
import "SubWorkflow/Alignment.wdl" as WfAlign

workflow MyMainWorkflow {
  ...
}
```
