On-demand Spark cluster

On the Seqslab platform, all bioinformatic pipelines will be standardized with WDL. In order to empower users to apply parallel computing technology, each WDL task will be submitted as a Spark task to an Apache Spark (external link) cluster, provisioned from cloud resource providers at the beginning of main workflow, sub-workflow, or task based on different needs. For more information on using clusters, see Cluster reuse.

Executing each WDL task as a Spark task provides the following benefits:

Parallel data localization

When localizing data to the provisioned VMs, the Spark framework can parallelize the download process since the WDL inputs are stored in an Azure Storage account such as Gen2.

Parallel pipe commands

Since each WDL task defines the command section, the Spark framework enables RDD pipe (external link) to execute the defined command parallelly if the input data are split into multiple partitions.

Parallel data delocalization

Once the command process is finished, data will be put back to the Azure Gen2 Storage account and parallel writing can also be enabled through Spark.

Parallel data transformation

Data chunking mechanisms for different types of data might also benefit from the Spark framework. For example, the SeqsLab platform uses the third-party library Adam (external link) which can be accelerated using Spark.