Operator pipelines#

Atgenomix currently provides the following operator pipelines for workload acceleration on SeqsLab.

Table 11 Output operator Pipelines#

Name

File type

Argument

Description

RegenieMasterFileWriter

master

None

Designed for delocalizing Regenie master files generated by the input pipeline Bgen100R and storing them with a standardized name “fit_parallel.master”.

DeltaWriterByColumn

delta

partitionBy: name of columns, the output will be partitioned by the given columns (e.g. runName or runName,sampleId). partitionNum: an integer value, the output file will be partitioned based on this value.

Designed for saving DataFrames to Delta tables with specific partitionBy settings, optimized for tasks utilizing SQL commands.

DeltaToCsvWriter

csv,tsv,csv.gz,tsv.gz

header: indicating whether or not to include a header(e.g. true, false). delimiter: the delimiter to use. partitionNum: an integer value, the output file will be partitioned based on this value.

Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a CSV file.

DeltaToJsonWriter

json

partitionNum: an integer value, the output file will be partitioned based on this value.

Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a JSON file.

VcfToGenotypeDelta

vcf,gvcf,vcf.gz,gvcf.gz

None

Optimized for transforming VCF files into a DataFrame table that contains the runName and genotypes for each variant and sample, optimized for tasks utilizing SQL commands.

VcfToSampleListDelta

vcf,gvcf,vcf.gz,gvcf.gz

None

Optimized for transforming VCF files into a DataFrame table that contains variant occurrences across samples, optimized for tasks utilizing SQL commands.

VEPvcfToVariantsDelta

vcf,gvcf,vcf.gz,gvcf.gz

None

Designed for transforming “VEP (Variant Effect Predictor)”-annotated VCF files into a DataFrame table, optimized for tasks utilizing SQL commands.

Table 12 SeqsLab provided partBed for HG38#

Description

URI

HG38 primary contigs in 1 partitions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/single_node_workflow

HG38 autosomes parallelized into 22 partitions, one autosome per partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/autosomes

HG38 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/chromosomes

HG38 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/chromosomes+softclip_or_discordant_reads

HG38 primary contigs parallelized into 50 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_50_parts

HG38 primary contigs parallelized into 50 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis.

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_50_parts+softclip_or_discordant_reads

HG38 primary contigs parallelized into 155 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_155_parts

HG38 primary contigs parallelized into 323 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_323_parts_unpadded

HG38 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_323_parts

HG38 primary contigs parallelized into 3101 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_3101_parts_unpadded

HG38 primary contigs parallelized into 3101 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_3101_parts

HG38 primary contigs parallelized into 20361 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_20361_parts
Table 13 SeqsLab provided partBed for HG19#

Description

URI

HG19 primary contigs in 1 partitions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/single_node_workflow

HG19 autosomes parallelized into 22 partitions, one autosome per partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/autosomes

HG19 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/chromosomes

HG19 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/chromosomes+softclip_or_discordant_reads

HG19 primary contigs parallelized into 77 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_77_parts

HG19 primary contigs parallelized into 77 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis.

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_77_parts+softclip_or_discordant_reads

HG19 primary contigs parallelized into 155 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_155_parts

HG19 primary contigs parallelized into 323 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_323_parts_unpadded

HG19 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_323_parts

HG19 primary contigs parallelized into 3109 contiguous unmasked regions

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_3109_parts_unpadded

HG19 primary contigs parallelized into 3109 contiguous unmasked regions, with 1kbp padding on two sides of each regions.

https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_3109_parts
Table 14 SeqsLab provided reference sequence dictionaries#

Description

URI

HG38 reference genome

https://seqslabbundles.blob.core.windows.net/static/reference/38/BROAD-PUB-REF/Homo_sapiens_assembly38.dict

HG38 reference genome with primary contigs only

https://seqslabbundles.blob.core.windows.net/static/reference/38/PRIMARY/Homo_sapiens_assembly38.dict

HG19 reference genome

https://seqslabbundles.blob.core.windows.net/static/reference/19/HG/ref.dict

HG19 reference genome with primary contigs only

https://seqslabbundles.blob.core.windows.net/static/reference/19/HG-primary/ref.dict