Operator pipelines

Operator pipelines#

Atgenomix currently provides the following operator pipelines for workload acceleration on SeqsLab.

Table 12 Input operator Pipelines#
Name	File type	Argument	Description
`R100`	All	None	Optimized for localizing a file or directory to each cluster node, ensuring persistence throughout the cluster’s lifespan.
`P100`	All	None	Designed for efficiently distributing the pre-partitioned files to each task executor without additional partitioning steps.
`S100`	All	None	Designed for localizing input files to a single machine for single-node computation.
`Fastq100`	`.fastq`, `.fastq.gz`, `fq.gz`	`readsPerChunk`: indicating how many fastq records per partition.	Optimized for partitioning large FASTQ datasets and parallel processing reads within each partition.
`Bam1`	`.bam`	None	Designed for processing BAM files where all reads are present in a single partition.
`Ubam1`	`.bam`	None	Designed for processing BAM files where unmapped reads are present in a single partition.
`Bam100`	`.bam`	`partBed`: a URL defining how BAM file is partitioned. Find out more about SeqsLab provided partBed for HG38. `refSeqDict(Optional)`: a URL defining the sequence dictionary of reference genome. Inferring from BAM file if not given.	Designed for efficiently partitioning BAM files based on the regions defined in the BED file while ensuring read mates are in the same partition and parallel processing reads within each region.
`Vcf100`	`.vcf,gvcf,vcf.gz,gvcf.gz`	`partBed`: a URL defining how VCF file is partitioned. Find out more about SeqsLab provided partBed for HG38. `refSeqDict(Optional)`: a URL defining the sequence dictionary of reference genome. Inferring from VCF file if not given.	Designed for efficiently partitioning VCF files based on the regions defined in the BED file and parallel processing variants within each region.
`Vcf200`	`.vcf,gvcf,vcf.gz,gvcf.gz`	`batchSize`: defining the number of VCF files per iteration of operator pipeline. `batchExecution`: set to 1 to execute task command right after each iteration of operator pipeline; otherwise, when not set (default) executing task command after all batch iterations are completed. `partBed`: a URL defining how VCF file is partitioned. Find out more about SeqsLab provided partBed for HG38. `refSeqDict(Optional)`: a URL defining the sequence dictionary of reference genome. Inferring from VCF file if not given.	Optimized Vcf100 for large-scale partitioning of numerous VCF files and efficient processing of VCF batch workloads, e.g. JointGenotyping.
`Bed100`	`bed,bed.gz`	`partBed`: a URL defining how bam file is partitioned. Find out more about SeqsLab provided partBed for HG38. `refSeqDict`: a URL defining the sequence dictionary of reference genome. Find out more about SeqsLab provided reference sequence dictionaries	Designed for partitioning BED files based on the regions defined in another BED file. This is mostly useful for genomic intervals over which to process in parallel.
`Bgen100R`	`.bgen`	`bsize`: a integer defining number of variants per partition.	Optimized for parallel processing BGEN files of large genome-wide association studies in Regenie Step1-Stage1 based on provided bsize (default 1000).
`Bgen100`	`.bgen`	None	Designed for loading large BGEN files based on the in-file index and parallel processing variants within each partition.
`Delta`	`delta`	None	Designed for DataFrame workloads associated with tasks utilizing SQL commands.
`VcfToDelta`	`vcf,gvcf,vcf.gz,gvcf.gz`	None	Designed for transforming VCF files into a DataFrame table for tasks utilizing SQL commands.
`CsvToDelta`	`csv,tsv,csv.gz,tsv.gz`	None	Designed for transforming CSV and TSV files into a Delta table for tasks utilizing SQL commands.
`CsvToDeltaIndex`	`csv,tsv,csv.gz,tsv.gz`	None	Optimized for transforming CSV and TSV files along with the index files into DataFrame tables for tasks utilizing SQL commands.

Table 13 Output operator Pipelines#
Name	File type	Argument	Description
`RegenieMasterFileWriter`	`master`	None	Designed for delocalizing Regenie master files generated by the input pipeline `Bgen100R` and storing them with a standardized name “fit_parallel.master”.
`DeltaWriterByColumn`	`delta`	`partitionBy`: name of columns, the output will be partitioned by the given columns (e.g. runName or runName,sampleId). `partitionNum`: an integer value, the output file will be partitioned based on this value.	Designed for saving DataFrames to Delta tables with specific `partitionBy` settings, optimized for tasks utilizing SQL commands.
`DeltaToCsvWriter`	`csv,tsv,csv.gz,tsv.gz`	`header`: indicating whether or not to include a header(e.g. true, false). `delimiter`: the delimiter to use. `partitionNum`: an integer value, the output file will be partitioned based on this value.	Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a CSV file.
`DeltaToJsonWriter`	`json`	`partitionNum`: an integer value, the output file will be partitioned based on this value.	Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a JSON file.
`VcfToGenotypeDelta`	`vcf,gvcf,vcf.gz,gvcf.gz`	None	Optimized for transforming VCF files into a DataFrame table that contains the runName and genotypes for each variant and sample, optimized for tasks utilizing SQL commands.
`VcfToSampleListDelta`	`vcf,gvcf,vcf.gz,gvcf.gz`	None	Optimized for transforming VCF files into a DataFrame table that contains variant occurrences across samples, optimized for tasks utilizing SQL commands.
`VEPvcfToVariantsDelta`	`vcf,gvcf,vcf.gz,gvcf.gz`	None	Designed for transforming “VEP (Variant Effect Predictor)”-annotated VCF files into a DataFrame table, optimized for tasks utilizing SQL commands.

Table 14 SeqsLab provided partBed for HG38#
Description	URI
HG38 primary contigs in 1 partitions	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/single_node_workflow
HG38 autosomes parallelized into 22 partitions, one autosome per partition	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/autosomes
HG38 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/chromosomes
HG38 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/chromosomes+softclip_or_discordant_reads
HG38 primary contigs parallelized into 50 contiguous unmasked regions	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_50_parts
HG38 primary contigs parallelized into 50 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis.	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_50_parts+softclip_or_discordant_reads
HG38 primary contigs parallelized into 155 contiguous unmasked regions	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_155_parts_unpadded
HG38 primary contigs parallelized into 155 contiguous unmasked regions, with 1kbp padding on two sides of each regions.	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_155_parts
HG38 primary contigs parallelized into 323 contiguous unmasked regions	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_323_parts_unpadded
HG38 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions.	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_323_parts
HG38 primary contigs parallelized into 3101 contiguous unmasked regions	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_3101_parts_unpadded
HG38 primary contigs parallelized into 3101 contiguous unmasked regions, with 1kbp padding on two sides of each regions.	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_3101_parts
HG38 primary contigs parallelized into 20361 contiguous unmasked regions	https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_20361_parts

Table 15 SeqsLab provided partBed for HG19#
Description	URI
HG19 primary contigs in 1 partitions	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/single_node_workflow
HG19 autosomes parallelized into 22 partitions, one autosome per partition	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/autosomes
HG19 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/chromosomes
HG19 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/chromosomes+softclip_or_discordant_reads
HG19 primary contigs parallelized into 77 contiguous unmasked regions	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_77_parts
HG19 primary contigs parallelized into 77 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis.	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_77_parts+softclip_or_discordant_reads
HG19 primary contigs parallelized into 155 contiguous unmasked regions	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_155_parts
HG19 primary contigs parallelized into 323 contiguous unmasked regions	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_323_parts_unpadded
HG19 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions.	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_323_parts
HG19 primary contigs parallelized into 3109 contiguous unmasked regions	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_3109_parts_unpadded
HG19 primary contigs parallelized into 3109 contiguous unmasked regions, with 1kbp padding on two sides of each regions.	https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_3109_parts

Table 16 SeqsLab provided reference sequence dictionaries#
Description	URI
HG38 reference genome	https://seqslabbundles.blob.core.windows.net/static/reference/38/BROAD-PUB-REF/Homo_sapiens_assembly38.dict
HG38 reference genome with primary contigs only	https://seqslabbundles.blob.core.windows.net/static/reference/38/PRIMARY/Homo_sapiens_assembly38.dict
HG19 reference genome	https://seqslabbundles.blob.core.windows.net/static/reference/19/HG/ref.dict
HG19 reference genome with primary contigs only	https://seqslabbundles.blob.core.windows.net/static/reference/19/HG-primary/ref.dict