Operator pipelines#

Atgenomix currently provides the following operator pipelines for workload acceleration on SeqsLab.

Table 12 Input operator Pipelines#

Name

File type

Argument

Description

R100

All

None

Optimized for localizing a file or directory to each cluster node, ensuring persistence throughout the cluster’s lifespan.

P100

All

None

Designed for efficiently distributing the pre-partitioned files to each task executor without additional partitioning steps.

S100

All

None

Designed for localizing input files to a single machine for single-node computation.

Fastq100

.fastq, .fastq.gz, fq.gz

readsPerChunk: indicating how many fastq records per partition.

Optimized for partitioning large FASTQ datasets and parallel processing reads within each partition.

Bam1

.bam

None

Designed for processing BAM files where all reads are present in a single partition.

Ubam1

.bam

None

Designed for processing BAM files where unmapped reads are present in a single partition.

Bam100

.bam

partBed: a URL defining how BAM file is partitioned. Find out more about SeqsLab provided partBed for HG38. refSeqDict(Optional): a URL defining the sequence dictionary of reference genome. Inferring from BAM file if not given.

Designed for efficiently partitioning BAM files based on the regions defined in the BED file while ensuring read mates are in the same partition and parallel processing reads within each region.

Vcf100

.vcf,gvcf,vcf.gz,gvcf.gz

partBed: a URL defining how VCF file is partitioned. Find out more about SeqsLab provided partBed for HG38. refSeqDict(Optional): a URL defining the sequence dictionary of reference genome. Inferring from VCF file if not given.

Designed for efficiently partitioning VCF files based on the regions defined in the BED file and parallel processing variants within each region.

Vcf200

.vcf,gvcf,vcf.gz,gvcf.gz

batchSize: defining the number of VCF files per iteration of operator pipeline. batchExecution: set to 1 to execute task command right after each iteration of operator pipeline; otherwise, when not set (default) executing task command after all batch iterations are completed. partBed: a URL defining how VCF file is partitioned. Find out more about SeqsLab provided partBed for HG38. refSeqDict(Optional): a URL defining the sequence dictionary of reference genome. Inferring from VCF file if not given.

Optimized Vcf100 for large-scale partitioning of numerous VCF files and efficient processing of VCF batch workloads, e.g. JointGenotyping.

Bed100

bed,bed.gz

partBed: a URL defining how bam file is partitioned. Find out more about SeqsLab provided partBed for HG38. refSeqDict: a URL defining the sequence dictionary of reference genome. Find out more about SeqsLab provided reference sequence dictionaries

Designed for partitioning BED files based on the regions defined in another BED file. This is mostly useful for genomic intervals over which to process in parallel.

Bgen100R

.bgen

bsize: a integer defining number of variants per partition.

Optimized for parallel processing BGEN files of large genome-wide association studies in Regenie Step1-Stage1 based on provided bsize (default 1000).

Bgen100

.bgen

None

Designed for loading large BGEN files based on the in-file index and parallel processing variants within each partition.

Delta

delta

None

Designed for DataFrame workloads associated with tasks utilizing SQL commands.

VcfToDelta

vcf,gvcf,vcf.gz,gvcf.gz

None

Designed for transforming VCF files into a DataFrame table for tasks utilizing SQL commands.

CsvToDelta

csv,tsv,csv.gz,tsv.gz

None

Designed for transforming CSV and TSV files into a Delta table for tasks utilizing SQL commands.

CsvToDeltaIndex

csv,tsv,csv.gz,tsv.gz

None

Optimized for transforming CSV and TSV files along with the index files into DataFrame tables for tasks utilizing SQL commands.

Table 13 Output operator Pipelines#

Name

File type

Argument

Description

RegenieMasterFileWriter

master

None

Designed for delocalizing Regenie master files generated by the input pipeline Bgen100R and storing them with a standardized name “fit_parallel.master”.

DeltaWriterByColumn

delta

partitionBy: name of columns, the output will be partitioned by the given columns (e.g. runName or runName,sampleId). partitionNum: an integer value, the output file will be partitioned based on this value.

Designed for saving DataFrames to Delta tables with specific partitionBy settings, optimized for tasks utilizing SQL commands.

DeltaToCsvWriter

csv,tsv,csv.gz,tsv.gz

header: indicating whether or not to include a header(e.g. true, false). delimiter: the delimiter to use. partitionNum: an integer value, the output file will be partitioned based on this value.

Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a CSV file.

DeltaToJsonWriter

json

partitionNum: an integer value, the output file will be partitioned based on this value.

Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a JSON file.

VcfToGenotypeDelta

vcf,gvcf,vcf.gz,gvcf.gz

None

Optimized for transforming VCF files into a DataFrame table that contains the runName and genotypes for each variant and sample, optimized for tasks utilizing SQL commands.

VcfToSampleListDelta

vcf,gvcf,vcf.gz,gvcf.gz

None

Optimized for transforming VCF files into a DataFrame table that contains variant occurrences across samples, optimized for tasks utilizing SQL commands.

VEPvcfToVariantsDelta

vcf,gvcf,vcf.gz,gvcf.gz

None

Designed for transforming “VEP (Variant Effect Predictor)”-annotated VCF files into a DataFrame table, optimized for tasks utilizing SQL commands.

Table 14 SeqsLab provided partBed for HG38#

Description

URI

HG38 primary contigs in 1 partitions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/single_node_workflow

HG38 autosomes parallelized into 22 partitions, one autosome per partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/autosomes

HG38 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/chromosomes

HG38 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/chromosomes+softclip_or_discordant_reads

HG38 primary contigs parallelized into 50 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_50_parts

HG38 primary contigs parallelized into 50 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis.

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_50_parts+softclip_or_discordant_reads

HG38 primary contigs parallelized into 155 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_155_parts_unpadded

HG38 primary contigs parallelized into 155 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_155_parts

HG38 primary contigs parallelized into 323 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_323_parts_unpadded

HG38 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_323_parts

HG38 primary contigs parallelized into 3101 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_3101_parts_unpadded

HG38 primary contigs parallelized into 3101 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_3101_parts

HG38 primary contigs parallelized into 20361 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_20361_parts
Table 15 SeqsLab provided partBed for HG19#

Description

URI

HG19 primary contigs in 1 partitions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/single_node_workflow

HG19 autosomes parallelized into 22 partitions, one autosome per partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/autosomes

HG19 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/chromosomes

HG19 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/chromosomes+softclip_or_discordant_reads

HG19 primary contigs parallelized into 77 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_77_parts

HG19 primary contigs parallelized into 77 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis.

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_77_parts+softclip_or_discordant_reads

HG19 primary contigs parallelized into 155 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_155_parts

HG19 primary contigs parallelized into 323 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_323_parts_unpadded

HG19 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_323_parts

HG19 primary contigs parallelized into 3109 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_3109_parts_unpadded

HG19 primary contigs parallelized into 3109 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_3109_parts
Table 16 SeqsLab provided reference sequence dictionaries#

Description

URI

HG38 reference genome

https://seqslabbundles.blob.core.windows.net/static/reference/38/BROAD-PUB-REF/Homo_sapiens_assembly38.dict

HG38 reference genome with primary contigs only

https://seqslabbundles.blob.core.windows.net/static/reference/38/PRIMARY/Homo_sapiens_assembly38.dict

HG19 reference genome

https://seqslabbundles.blob.core.windows.net/static/reference/19/HG/ref.dict

HG19 reference genome with primary contigs only

https://seqslabbundles.blob.core.windows.net/static/reference/19/HG-primary/ref.dict