Operator pipelines#
Atgenomix currently provides the following operator pipelines for workload acceleration on SeqsLab.
Name |
File type |
Argument |
Description |
|---|---|---|---|
|
All |
None |
Optimized for localizing a file or directory to each cluster node, ensuring persistence throughout the cluster’s lifespan. |
|
All |
None |
Optimized for localizing a file or directory to each cluster node, ensuring persistence throughout the task’s lifespan. |
|
All |
None |
Designed for efficiently distributing the pre-partitioned files to each task executor without additional partitioning steps. |
|
All |
None |
Designed for localizing input files to a single machine for single-node computation. |
|
|
|
Optimized for partitioning large FASTQ datasets and parallel processing reads within each partition. |
|
|
None |
Designed for processing BAM files where all reads are present in a single partition. |
|
|
None |
Designed for processing BAM files where unmapped reads are present in a single partition. |
|
|
|
Designed for efficiently partitioning BAM files based on the regions defined in the BED file while ensuring read mates are in the same partition and parallel processing reads within each region. |
|
|
|
Designed for efficiently partitioning VCF files based on the regions defined in the BED file and parallel processing variants within each region. |
|
|
|
Optimized Vcf100 for large-scale partitioning of numerous VCF files and efficient processing of VCF batch workloads. |
|
|
|
Designed for efficiently partitioning VCF files based on the GRCh38 autosomal regions and parallel phasing variants within each region. Users can retrieve each region’s contigName using ~{jsonpath(“$.dataset.partition().contigName”)} |
|
|
|
Designed for efficiently partitioning VCF files based on the regions defined in the imp5Chunker output file and parallel imputation of variants within each region. Users can retrieve each region’s value corresponding to the imp5Chunker output file using the jsonpath function. |
|
|
|
Designed for partitioning BED files based on the regions defined in another BED file. This is mostly useful for genomic intervals over which to process in parallel. |
|
|
|
Optimized for parallel processing BGEN files of large genome-wide association studies in Regenie Step1-Stage1 based on provided bsize (default 1000). |
|
|
None |
Designed for loading large BGEN files based on the in-file index and parallel processing variants within each partition. |
|
|
None |
Designed for DataFrame workloads associated with tasks utilizing SQL commands. |
|
All |
None |
Designed for DataFrame workloads associated with tasks utilizing Scala commands. |
|
|
|
Designed for repartitioning a DeltaTable by rows according to given rowsPerPartition argument. |
|
|
None |
Designed for transforming VCF files into a DataFrame table for tasks utilizing SQL commands. |
|
|
None |
Designed for transforming CSV and TSV files into a Delta table for tasks utilizing SQL commands. |
|
|
None |
Optimized for transforming CSV and TSV files along with the index files into DataFrame tables for tasks utilizing SQL commands. |
Name |
File type |
Argument |
Description |
|---|---|---|---|
|
|
None |
Designed for delocalizing Regenie master files generated by the input pipeline |
|
|
|
Designed for saving DataFrames to Delta tables with specific |
|
|
|
Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a CSV file. |
|
|
|
Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a JSON file. |
|
|
|
Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a Vcf file with Glow. |
|
|
None |
Optimized for transforming VCF files into a DataFrame table that contains the runName and genotypes for each variant and sample, optimized for tasks utilizing SQL commands. |
|
|
None |
Optimized for transforming VCF files into a DataFrame table that contains variant occurrences across samples, optimized for tasks utilizing SQL commands. |
|
|
None |
Designed for transforming “VEP (Variant Effect Predictor)”-annotated VCF files into a DataFrame table, optimized for tasks utilizing SQL commands. |
|
|
|
Optimized for partitioning large FASTA datasets and parallel processing nucleotide sequences within each partition. |
|
|
|
Designed for efficiently partitioning SAM files generated in HUMAnN3 PangenomeSearch pipeline, based on the regions defined in the BED file and parallel processing reads within each region. |
|
|
|
Designed for efficiently partitioning TSV files generated in HUMAnN3 TranslatedSearch pipeline, based on the regions defined in the BED file and parallel processing records within each region. |
Description |
URI |
|---|---|
HG38 primary contigs in 1 partitions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/single_node_workflow
|
HG38 autosomes parallelized into 22 partitions, one autosome per partition |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/autosomes
|
HG38 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/chromosomes
|
HG38 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/chromosomes+softclip_or_discordant_reads
|
HG38 primary contigs parallelized into 50 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_50_parts
|
HG38 primary contigs parallelized into 50 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_50_parts+softclip_or_discordant_reads
|
HG38 primary contigs parallelized into 155 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_155_parts_unpadded
|
HG38 primary contigs parallelized into 155 contiguous unmasked regions, with 1kbp padding on two sides of each regions. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_155_parts
|
HG38 primary contigs parallelized into 323 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_323_parts_unpadded
|
HG38 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_323_parts
|
HG38 primary contigs parallelized into 3101 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_3101_parts_unpadded
|
HG38 primary contigs parallelized into 3101 contiguous unmasked regions, with 1kbp padding on two sides of each regions. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_3101_parts
|
HG38 primary contigs parallelized into 20361 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_20361_parts
|
Description |
URI |
|---|---|
HG19 primary contigs in 1 partitions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/single_node_workflow
|
HG19 autosomes parallelized into 22 partitions, one autosome per partition |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/autosomes
|
HG19 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/chromosomes
|
HG19 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/chromosomes+softclip_or_discordant_reads
|
HG19 primary contigs parallelized into 77 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_77_parts
|
HG19 primary contigs parallelized into 77 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_77_parts+softclip_or_discordant_reads
|
HG19 primary contigs parallelized into 155 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_155_parts
|
HG19 primary contigs parallelized into 323 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_323_parts_unpadded
|
HG19 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_323_parts
|
HG19 primary contigs parallelized into 3109 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_3109_parts_unpadded
|
HG19 primary contigs parallelized into 3109 contiguous unmasked regions, with 1kbp padding on two sides of each regions. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_3109_parts
|
Description |
URI |
|---|---|
HG38 reference genome |
https://seqslabbundles.blob.core.windows.net/static/reference/38/BROAD-PUB-REF/Homo_sapiens_assembly38.dict
|
HG38 reference genome with primary contigs only |
https://seqslabbundles.blob.core.windows.net/static/reference/38/PRIMARY/Homo_sapiens_assembly38.dict
|
HG19 reference genome |
https://seqslabbundles.blob.core.windows.net/static/reference/19/HG/ref.dict
|
HG19 reference genome with primary contigs only |
https://seqslabbundles.blob.core.windows.net/static/reference/19/HG-primary/ref.dict
|