Operator types#

SeqsLab currently provides the two main operator categories: localization and delocalization. Each category can further be grouped into the following operator types:

Localization

Delocalization

Loader

Collector

Transformer

Writer

Formatter

Executor

Localization#

Localization loads datasets from a source (such as blob storage) and optionally transforms the dataset to meet the requirements of distributed task commands.

Computation passes DataFrame partitions to task command as inputs and executes the task command (such as shell script or SQL).

Loader#

Loaders are responsible for loading a dataset into an in-memory DataFrame or for copying a dataset from a specific data source, such as a blob storage, to the local host file system.

Loaders also inform SeqsLab how it should process the data since SeqsLab supports multiple data processing options that manage and optimize workloads. For example, the CopyToLocal loader operator can copy or localize genome reference files to all available computing nodes.

Operator

Description

RefLoader

Automatic workload pipeline for localizing either a file or directory shared within cluster in a single node cluster

CopyToLocalLoader

RefLoader#

Data types
Configuration
Sample use case
Performance metrics

CopyToLocalLoader#

Data types
Configuration
Sample use case
Performance metrics

Transformer#

Transformers are responsible for repartitioning and sorting dataframes to optimize downstream data processing. For example, in genome sequencing analysis, a transformer can repartition BAM or VCF datasets based on non-overlapping target regions.

Operator

Description

FastqPartitioner

BamPartitionerPart1

BamPartitionerPart1Unmap

BamPartitionerHg19Part23

BamPartitionerHg19Chr20Part45

BamPartitionerHg19Part155

BamPartitionerHg19Part3109

BamPartitionerHg19Part155Consensus

BamPartitionerGRCh38Part23

BamPartitionerGRCh38Part50Consensus

BamPartitionerGRCh38Part50

BamPartitionerGRCh38Part3101

VcfPartitionerHg19Part1

VcfPartitionerHg19Part23

VcfPartitionerHg19Part155

VcfPartitionerHg19Part3109

VcfPartitionerHg19Part3109Unpadded

VcfPartitionerGRCh38Part1

VcfPartitionerGRCh38Part23

VcfPartitionerGRCh38Part155

VcfPartitionerGRCh38Part3101

VcfDataFrameTransformer

VcfGlowTransformer

FastqPartitioner#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerPart1#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerPart1Unmap#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerHg19Part23#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerHg19Chr20Part45#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerHg19Part155#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerHg19Part3109#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerHg19Part155Consensus#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerGRCh38Part23#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerGRCh38Part50Consensus#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerGRCh38Part50#

Data types
Configuration
Sample use case
Performance metrics

BamPartitionerGRCh38Part3101#

Data types
Configuration
Sample use case
Performance metrics

VcfPartitionerHg19Part1#

Data types
Configuration
Sample use case
Performance metrics

VcfPartitionerHg19Part23#

Data types
Configuration
Sample use case
Performance metrics

VcfPartitionerHg19Part155#

Data types
Configuration
Sample use case
Performance metrics

VcfPartitionerHg19Part3109#

Data types
Configuration
Sample use case
Performance metrics

VcfPartitionerHg19Part3109Unpadded#

Data types
Configuration
Sample use case
Performance metrics

VcfPartitionerGRCh38Part1#

Data types
Configuration
Sample use case
Performance metrics

VcfPartitionerGRCh38Part23#

Data types
Configuration
Sample use case
Performance metrics

VcfPartitionerGRCh38Part155#

Data types
Configuration
Sample use case
Performance metrics

VcfPartitionerGRCh38Part3101#

Data types
Configuration
Sample use case
Performance metrics

VcfDataFrameTransformer#

Data types
Configuration
Sample use case
Performance metrics

VcfGlowTransformer#

Data types
Configuration
Sample use case
Performance metrics

Formatter#

Formatters are responsible for formatting input datasets by converting the schema, adding or deleting columns, or encoding domain-specific objects.

Executor#

Executors are responsible for preprocessing or localizing an input DataFrame as a managed table for a Spark SQL command, or for saving data to local files for shell script execution. An executor is required for each input DataFrame of workflow tasks before a pipeline can execute task commands.

Operator

Description

BamExecutor

CsvExecutor

FastqExecutor

VcfExecutor

TableLocalizationExecutor

BamExecutor#

Data types

bam

Configuration
Sample use case
Performance metrics

CsvExecutor#

Data types

csv

Configuration
Sample use case
Performance metrics

FastqExecutor#

Data types

fq.gz, fq.bgz, fastq.gz, fastq.bgz

Configuration
Sample use case
Performance metrics

VcfExecutor#

Data types

vcf.gz vcf.bgz

Configuration
Sample use case
Performance metrics

TableLocalizationExecutor#

Data types

delta

Configuration
Sample use case
Performance metrics

Delocalization#

Delocalization collects a file or dataset outputted from a task command and saves it to a destination (such as blob storage).

Collector#

Collectors are responsible for retrieving the outputs of an executed command from the local file system, and then returning them to a DataFrame. Additionally, it can be responsible for computing aggregates of command outputs. A collector is required for each output file of a workflow task after the command execution has successfully completed.

Operator

Description

BamCollector

TexCollector

BamCollector#

Data types
Configuration
Sample use case
Performance metrics

TexCollector#

Data types
Configuration
Sample use case
Performance metrics

Writer#

Writers are responsible for delocalizing or saving the command execution output DataFrame to the specified storage or repository, such as a cloud file system, HTTPS repository, or JDBC database.

Operator

Description

GeneralWriter

GeneralWriter#

Data types
Configuration
Sample use case
Performance metrics