Operator types#
SeqsLab currently provides the two main operator categories: localization and delocalization. Each category can further be grouped into the following operator types:
Localization |
Delocalization |
---|---|
Localization#
Localization loads datasets from a source (such as blob storage) and optionally transforms the dataset to meet the requirements of distributed task commands.
Computation passes DataFrame partitions to task command as inputs and executes the task command (such as shell script or SQL).
Loader#
Loaders are responsible for loading a dataset into an in-memory DataFrame or for copying a dataset from a specific data source, such as a blob storage, to the local host file system.
Loaders also inform SeqsLab how it should process the data since SeqsLab supports multiple data processing options that manage and optimize workloads. For example, the CopyToLocal
loader operator can copy or localize genome reference files to all available computing nodes.
Operator |
Description |
---|---|
Automatic workload pipeline for localizing either a file or directory shared within cluster in a single node cluster |
|
Transformer#
Transformers are responsible for repartitioning and sorting dataframes to optimize downstream data processing. For example, in genome sequencing analysis, a transformer can repartition BAM or VCF datasets based on non-overlapping target regions.
Operator |
Description |
---|---|
Formatter#
Formatters are responsible for formatting input datasets by converting the schema, adding or deleting columns, or encoding domain-specific objects.
Executor#
Executors are responsible for preprocessing or localizing an input DataFrame as a managed table for a Spark SQL command, or for saving data to local files for shell script execution. An executor is required for each input DataFrame of workflow tasks before a pipeline can execute task commands.
Operator |
Description |
---|---|
FastqExecutor#
fq.gz
, fq.bgz
, fastq.gz
, fastq.bgz
Delocalization#
Delocalization collects a file or dataset outputted from a task command and saves it to a destination (such as blob storage).
Collector#
Collectors are responsible for retrieving the outputs of an executed command from the local file system, and then returning them to a DataFrame. Additionally, it can be responsible for computing aggregates of command outputs. A collector is required for each output file of a workflow task after the command execution has successfully completed.
Operator |
Description |
---|---|
Writer#
Writers are responsible for delocalizing or saving the command execution output DataFrame to the specified storage or repository, such as a cloud file system, HTTPS repository, or JDBC database.
Operator |
Description |
---|---|