Workflow management

In the previous section, we introduced how to normalize bioinformatics pipelines based on Workflow Description Language (WDL) (external link). However, WDL only provides a framework for standardizing bioinformatic pipelines. An engine is still needed for executing the workflow described in a WDL file. This is what Cromwell does best. Developed by the Broad Institute, Cromwell is designed to execute pipelines based on the WDL syntax.

Cromwell engine

When you use the Cromwell engine, you should specify each task with a set of runtime attributes where customized resources such as a Docker container, the memory, CPU, and other backend resources are assigned to the specified task. On the other hand, users can also configure default runtime attributes for all tasks.

Below is an example (external link) of a task-level runtime attribute from the Cromwell documentation site:

task jes_task {
  command {
    echo "Hello JES!"
  }
  runtime {
    docker: "ubuntu:latest"
    memory: "4G"
    cpu: "3"
    zones: "us-central1-c us-central1-b"
    disks: "/mnt/mnt1 3 SSD, /mnt/mnt2 500 HDD"
  }
}
workflow jes_workflow {
  call jes_task
}

As a result, the task (jes_task) will be executed using the runtime attributes as shown above on the default backend provider (Google Cloud Platform). One of the biggest values of the Cromwell engine is its ability to run different tasks on various backends. It currently supports local backends and cloud backends such as Amazon Web Services (AWS), Alibaba Batch Compute (BCS), and Google Cloud among others. Microsoft’s own version, Cromwell on Azure, which was also derived from Cromwell, supports running WDL files on the Azure Batch backend.

Note

Just like all other Cromwell backends, Cromwell on Azure launches a virtual machine for each task based on its runtime attributes. In this regard, the SeqsLab Cromwell is different from Cromwell on Azure.

SeqsLab Cromwell vs. Cromwell

SeqsLab Cromwell, also known as the SeqsLab Workflow Execution Engine (Seqslab WE2), is also extended from Cromwell project. However, SeqsLab WE2 does not simply use the runtime attributes in the same way as the original Cromwell. As such, SeqsLab WE2 provides the following advantages:

  • On-demand Spark cluster

    Instead of provisioning a virtual machine for each task execution, Seqslab WE2 requests a parallel computing cluster based on Apache Spark (external link), which enables SeqsLab users to dynamically configure the Spark properties (external link).

  • Cluster reuse

    The processes of provisioning resources from cloud service providers and of establishing a Docker environment might take some time, depending on your situation. This is particularly true when runtime attributes are actually configured not just at the task level but also at the workflow level. SeqsLab WE2 supports dynamic runtime attributes settings for better resource utilization.

  • Implicit and dynamic data parallelization

    The SeqsLab management console provides users with an interface to easily configure implicit and dynamic data parallelization settings at the task level. By defining the inputs and outputs, users can parallelize the task computation without complicating the WDL commands and structures.

  • Output management

    Once output is specified on a WDL task session, SeqsLab WE2 will save the data to the specified filesystem and register the Data Repository Service (DRS) records.

Backend providers

SeqsLab WE2 currently supports two main backends. Users can either run their WDL on the cloud or locally. However, in certain situations, a hybrid of both backends may be used.

  • Cloud backend: Azure Batch

    SeqsLab WE2 uses Azure Batch (external link) as the main backend for cloud-based scenarios.

  • Local backend: Kubernetes

    For use cases that require analyses to be executed on private resources, SeqsLab provides a local backend using Kubernetes (external link).

  • Hybrid backend

    SeqsLab WE2 also enables users to submit tasks or workflows to either a cloud or local backend based on their compliance or operation needs. The following image shows how SeqsLab WE2 communicates with both local and cloud backends in the same run.

    hybrid-backends