Operator pipelines#

Atgenomix currently provides the following operator pipelines on SeqsLab.

Table 11 Input operator Pipelines#

Name

File type

Argument

Description

R100

All

None

Automatic workload pipeline for localizing either a file or directory, with the lifespan of the cluster

P100

All

None

Automatic workload pipeline for localizing a file per partition to each executor of the WDL task described as the FQN typed in Array of Files

Bgen100

.bgen

None

File-based BGEN workload pipeline parallelized by its’ original index within the file.

S100

All

None

Automatic workload pipeline for localizing either a file or directory in a single node cluster.

Bam1

.bam

None

File-based BAM workload pipeline with all BAM records in a single partition

Ubam1

.bam

None

File-based BAM workload pipeline with unmapped BAM records in a single partition

Bam100

.bam

partBed: a URL defining how bam file is partitioned. Find out more about SeqsLab provided partBed for HG38. refSeqDict(Optional): a URL defining the sequence dictionary of reference genome. Inferring from BAM file if not given.

File-based BAM workload pipeline with reads partitioned based on given partBed and refSeqDict.

Bam100P

.bam

partBed: a URL defining how bam file is partitioned. Find out more about SeqsLab provided partBed for HG38. refSeqDict(Optional): a URL defining the sequence dictionary of reference genome. Inferring from BAM file if not given.

File-based BAM workload pipeline with reads partitioned based on given partBed and refSeqDict, where every reads and its read mate are ensured to be in a same partition.

Vcf100

.vcf,gvcf,vcf.gz,gvcf.gz

partBed: a URL defining how bam file is partitioned. Find out more about SeqsLab provided partBed for HG38. refSeqDict(Optional): a URL defining the sequence dictionary of reference genome. Inferring from VCF file if not given.

File-based VCF workload pipeline with reads partitioned based on given partBed

Bed100

bed,bed.gz

partBed: a URL defining how bam file is partitioned. Find out more about SeqsLab provided partBed for HG38. refSeqDict: a URL defining the sequence dictionary of reference genome. Find out more about SeqsLab provided reference sequence dictionaries

File-based BED workload pipeline with bed records partitioned based on given partBed

Bgen100R

.bgen

bsize: a integer defining number of variants per partition.

File-based non-imputation BGEN workload pipeline parallelized by provided bsize. This workload pipeline’s design aligns with the official parallelization process of Regenie Step1-Stage1. It will create the master file (named “fit_parallel.master”), which can be output using the RegenieMasterFileWriter output operator pipeline, and split the BGEN into partitions based on the bsize value. By default, one partition refers to one genotype block in Regenie. Following Regenie’s split logic, the variants within each partition come from the same chromosome.

Fastq100

.fastq, .fastq.gz, fq.gz

readsPerChunk: indicating how many fastq records per partition.

File-based FASTQ workload parallelization pipeline with readsPerChunk, e.g. 1,048,576, read records for each partition.

Delta

delta

None

SparkSQL-based Delta Lake workload pipeline

VcfToDelta

vcf,gvcf,vcf.gz,gvcf.gz

None

SparkSQL-based VCF workload pipeline loading VCF file into Delta Table using Glow

CsvToDelta

csv,tsv,csv.gz,tsv.gz

None

File-based Csv/Tsv workload pipeline for SQL purposes

CsvToDeltaIndex

csv,tsv,csv.gz,tsv.gz

None

File-based Csv/Tsv workload pipeline with index for SQL purposes

Table 12 Output operator Pipelines#

Name

File type

Argument

Description

RegenieMasterFileWriter

master

None

File-based Regenie master file output pipeline. It will collect each partition’s master file, created by the input operator pipeline BgenRepartitionerByVariant, and output a master file named “fit_parallel.master”.

DeltaWriterByColumn

delta

None

File-based Delta Lake output pipeline with partitionBy settings for SQL purposes

DeltaToCsvWriter

csv,tsv,csv.gz,tsv.gz

None

File-based Csv/Tsv output pipeline with partition settings for SQL purposes

DeltaToJsonWriter

json

None

File-based Json output pipeline with partition settings for SQL purposes

Table 13 SeqsLab provided partBed for HG38#

Description

URI

HG38 primary contigs in 1 partitions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
single_node_workflow

HG38 autosomes parallelized into 22 partitions, one autosome per partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
autosomes

HG38 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
chromosomes

HG38 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
chromosomes+softclip_or_discordant_reads

HG38 primary contigs parallelized into 50 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
contiguous_unmasked_regions_50_parts

HG38 primary contigs parallelized into 50 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis.

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
contiguous_unmasked_regions_50_parts+softclip_or_discordant_reads

HG38 primary contigs parallelized into 155 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
contiguous_unmasked_regions_155_parts

HG38 primary contigs parallelized into 323 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
contiguous_unmasked_regions_323_parts_unpadded

HG38 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
contiguous_unmasked_regions_323_parts

HG38 primary contigs parallelized into 3101 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
contiguous_unmasked_regions_3101_parts_unpadded

HG38 primary contigs parallelized into 3101 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
contiguous_unmasked_regions_3101_parts

HG38 primary contigs parallelized into 20361 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/
contiguous_unmasked_regions_20361_parts
Table 14 SeqsLab provided partBed for HG19#

Description

URI

HG19 primary contigs in 1 partitions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
single_node_workflow

HG19 autosomes parallelized into 22 partitions, one autosome per partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
autosomes

HG19 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
chromosomes

HG19 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
chromosomes+softclip_or_discordant_reads

HG19 primary contigs parallelized into 77 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
contiguous_unmasked_regions_77_parts

HG19 primary contigs parallelized into 77 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis.

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
contiguous_unmasked_regions_77_parts+softclip_or_discordant_reads

HG19 primary contigs parallelized into 155 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
contiguous_unmasked_regions_155_parts

HG19 primary contigs parallelized into 323 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
contiguous_unmasked_regions_323_parts_unpadded

HG19 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
contiguous_unmasked_regions_323_parts

HG19 primary contigs parallelized into 3109 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
contiguous_unmasked_regions_3109_parts_unpadded

HG19 primary contigs parallelized into 3109 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/
contiguous_unmasked_regions_3109_parts
Table 15 SeqsLab provided reference sequence dictionaries#

Description

URI

HG38 reference genome

https://seqslabbundles.blob.core.windows.net/static/reference/38/
BROAD-PUB-REF/Homo_sapiens_assembly38.dict

HG38 reference genome with primary contigs only

https://seqslabbundles.blob.core.windows.net/static/reference/38/
PRIMARY/Homo_sapiens_assembly38.dict

HG19 reference genome

https://seqslabbundles.blob.core.windows.net/static/reference/19/
HG/ref.dict

HG19 reference genome with primary contigs only

https://seqslabbundles.blob.core.windows.net/static/reference/19/
HG-primary/ref.dict