Operator types

Operator types#

SeqsLab currently provides the two main operator categories: localization and delocalization. Each category can further be grouped into the following operator types:

Localization	Delocalization
Loader	Collector
Transformer	Writer
Formatter
Executor

Localization#

Localization loads datasets from a source (such as blob storage) and optionally transforms the dataset to meet the requirements of distributed task commands.

Computation passes DataFrame partitions to task command as inputs and executes the task command (such as shell script or SQL).

Loader#

Loaders are responsible for loading a dataset into an in-memory DataFrame or for copying a dataset from a specific data source, such as a blob storage, to the local host file system.

Loaders also inform SeqsLab how it should process the data since SeqsLab supports multiple data processing options that manage and optimize workloads. For example, the CopyToLocal loader operator can copy or localize genome reference files to all available computing nodes.

Operator	Description
RefLoader	Automatic workload pipeline for localizing either a file or directory shared within cluster in a single node cluster
CopyToLocalLoader

RefLoader#

Data types

Configuration

Sample use case

Performance metrics

CopyToLocalLoader#

Data types

Configuration

Sample use case

Performance metrics

Transformer#

Transformers are responsible for repartitioning and sorting dataframes to optimize downstream data processing. For example, in genome sequencing analysis, a transformer can repartition BAM or VCF datasets based on non-overlapping target regions.

Operator	Description
FastqPartitioner
BamPartitionerPart1
BamPartitionerPart1Unmap
BamPartitionerHg19Part23
BamPartitionerHg19Chr20Part45
BamPartitionerHg19Part155
BamPartitionerHg19Part3109
BamPartitionerHg19Part155Consensus
BamPartitionerGRCh38Part23
BamPartitionerGRCh38Part50Consensus
BamPartitionerGRCh38Part50
BamPartitionerGRCh38Part3101
VcfPartitionerHg19Part1
VcfPartitionerHg19Part23
VcfPartitionerHg19Part155
VcfPartitionerHg19Part3109
VcfPartitionerHg19Part3109Unpadded
VcfPartitionerGRCh38Part1
VcfPartitionerGRCh38Part23
VcfPartitionerGRCh38Part155
VcfPartitionerGRCh38Part3101
VcfDataFrameTransformer
VcfGlowTransformer

FastqPartitioner#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerPart1#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerPart1Unmap#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerHg19Part23#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerHg19Chr20Part45#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerHg19Part155#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerHg19Part3109#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerHg19Part155Consensus#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerGRCh38Part23#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerGRCh38Part50Consensus#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerGRCh38Part50#

Data types

Configuration

Sample use case

Performance metrics

BamPartitionerGRCh38Part3101#

Data types

Configuration

Sample use case

Performance metrics

VcfPartitionerHg19Part1#

Data types

Configuration

Sample use case

Performance metrics

VcfPartitionerHg19Part23#

Data types

Configuration

Sample use case

Performance metrics

VcfPartitionerHg19Part155#

Data types

Configuration

Sample use case

Performance metrics

VcfPartitionerHg19Part3109#

Data types

Configuration

Sample use case

Performance metrics

VcfPartitionerHg19Part3109Unpadded#

Data types

Configuration

Sample use case

Performance metrics

VcfPartitionerGRCh38Part1#

Data types

Configuration

Sample use case

Performance metrics

VcfPartitionerGRCh38Part23#

Data types

Configuration

Sample use case

Performance metrics

VcfPartitionerGRCh38Part155#

Data types

Configuration

Sample use case

Performance metrics

VcfPartitionerGRCh38Part3101#

Data types

Configuration

Sample use case

Performance metrics

VcfDataFrameTransformer#

Data types

Configuration

Sample use case

Performance metrics

VcfGlowTransformer#

Data types

Configuration

Sample use case

Performance metrics

Formatter#

Formatters are responsible for formatting input datasets by converting the schema, adding or deleting columns, or encoding domain-specific objects.

Executor#

Executors are responsible for preprocessing or localizing an input DataFrame as a managed table for a Spark SQL command, or for saving data to local files for shell script execution. An executor is required for each input DataFrame of workflow tasks before a pipeline can execute task commands.

Operator	Description
BamExecutor
CsvExecutor
FastqExecutor
VcfExecutor
TableLocalizationExecutor

BamExecutor#

Data types

bam

Configuration

Sample use case

Performance metrics

CsvExecutor#

Data types

csv

Configuration

Sample use case

Performance metrics

FastqExecutor#

Data types

fq.gz, fq.bgz, fastq.gz, fastq.bgz

Configuration

Sample use case

Performance metrics

VcfExecutor#

Data types

vcf.gz vcf.bgz

Configuration

Sample use case

Performance metrics

TableLocalizationExecutor#

Data types

delta

Configuration

Sample use case

Performance metrics

Delocalization#

Delocalization collects a file or dataset outputted from a task command and saves it to a destination (such as blob storage).

Collector#

Collectors are responsible for retrieving the outputs of an executed command from the local file system, and then returning them to a DataFrame. Additionally, it can be responsible for computing aggregates of command outputs. A collector is required for each output file of a workflow task after the command execution has successfully completed.

Operator	Description
BamCollector
TexCollector

BamCollector#

Data types

Configuration

Sample use case

Performance metrics

TexCollector#

Data types

Configuration

Sample use case

Performance metrics

Writer#

Writers are responsible for delocalizing or saving the command execution output DataFrame to the specified storage or repository, such as a cloud file system, HTTPS repository, or JDBC database.

Operator	Description
GeneralWriter

GeneralWriter#

Data types

Configuration

Sample use case

Performance metrics

Operator types

Contents

Operator types#

Localization#

Loader#

RefLoader#

CopyToLocalLoader#

Transformer#

FastqPartitioner#

BamPartitionerPart1#

BamPartitionerPart1Unmap#

BamPartitionerHg19Part23#

BamPartitionerHg19Chr20Part45#

BamPartitionerHg19Part155#

BamPartitionerHg19Part3109#

BamPartitionerHg19Part155Consensus#

BamPartitionerGRCh38Part23#

BamPartitionerGRCh38Part50Consensus#

BamPartitionerGRCh38Part50#

BamPartitionerGRCh38Part3101#

VcfPartitionerHg19Part1#

VcfPartitionerHg19Part23#

VcfPartitionerHg19Part155#

VcfPartitionerHg19Part3109#

VcfPartitionerHg19Part3109Unpadded#

VcfPartitionerGRCh38Part1#

VcfPartitionerGRCh38Part23#

VcfPartitionerGRCh38Part155#

VcfPartitionerGRCh38Part3101#

VcfDataFrameTransformer#

VcfGlowTransformer#

Formatter#

Executor#

BamExecutor#

CsvExecutor#

FastqExecutor#

VcfExecutor#

TableLocalizationExecutor#

Delocalization#

Collector#

BamCollector#

TexCollector#

Writer#

GeneralWriter#