Operator pipelines#
Atgenomix currently provides the following operator pipelines for workload acceleration on SeqsLab.
Name |
File type |
Argument |
Description |
---|---|---|---|
|
All |
None |
Optimized for localizing a file or directory to each cluster node, ensuring persistence throughout the cluster’s lifespan. |
|
All |
None |
Designed for efficiently distributing the pre-partitioned files to each task executor without additional partitioning steps. |
|
All |
None |
Designed for localizing input files to a single machine for single-node computation. |
|
|
|
Optimized for partitioning large FASTQ datasets and parallel processing reads within each partition. |
|
|
None |
Designed for processing BAM files where all reads are present in a single partition. |
|
|
None |
Designed for processing BAM files where unmapped reads are present in a single partition. |
|
|
|
Designed for efficiently partitioning BAM files based on the regions defined in the BED file while ensuring read mates are in the same partition and parallel processing reads within each region. |
|
|
|
Designed for efficiently partitioning VCF files based on the regions defined in the BED file and parallel processing variants within each region. |
|
|
|
Optimized Vcf100 for large-scale partitioning of numerous VCF files and efficient processing of VCF batch workloads, e.g. JointGenotyping. |
|
|
|
Designed for partitioning BED files based on the regions defined in another BED file. This is mostly useful for genomic intervals over which to process in parallel. |
|
|
|
Optimized for parallel processing BGEN files of large genome-wide association studies in Regenie Step1-Stage1 based on provided bsize (default 1000). |
|
|
None |
Designed for loading large BGEN files based on the in-file index and parallel processing variants within each partition. |
|
|
None |
Designed for DataFrame workloads associated with tasks utilizing SQL commands. |
|
|
None |
Designed for transforming VCF files into a DataFrame table for tasks utilizing SQL commands. |
|
|
None |
Designed for transforming CSV and TSV files into a Delta table for tasks utilizing SQL commands. |
|
|
None |
Optimized for transforming CSV and TSV files along with the index files into DataFrame tables for tasks utilizing SQL commands. |
Name |
File type |
Argument |
Description |
---|---|---|---|
|
|
None |
Designed for delocalizing Regenie master files generated by the input pipeline |
|
|
|
Designed for saving DataFrames to Delta tables with specific |
|
|
|
Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a CSV file. |
|
|
|
Designed for delocalizing DataFrames outputted from tasks utilizing SQL commands and saving to a JSON file. |
|
|
None |
Optimized for transforming VCF files into a DataFrame table that contains the runName and genotypes for each variant and sample, optimized for tasks utilizing SQL commands. |
|
|
None |
Optimized for transforming VCF files into a DataFrame table that contains variant occurrences across samples, optimized for tasks utilizing SQL commands. |
|
|
None |
Designed for transforming “VEP (Variant Effect Predictor)”-annotated VCF files into a DataFrame table, optimized for tasks utilizing SQL commands. |
Description |
URI |
---|---|
HG38 primary contigs in 1 partitions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/single_node_workflow
|
HG38 autosomes parallelized into 22 partitions, one autosome per partition |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/autosomes
|
HG38 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/chromosomes
|
HG38 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/chromosomes+softclip_or_discordant_reads
|
HG38 primary contigs parallelized into 50 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_50_parts
|
HG38 primary contigs parallelized into 50 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_50_parts+softclip_or_discordant_reads
|
HG38 primary contigs parallelized into 155 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_155_parts_unpadded
|
HG38 primary contigs parallelized into 155 contiguous unmasked regions, with 1kbp padding on two sides of each regions. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_155_parts
|
HG38 primary contigs parallelized into 323 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_323_parts_unpadded
|
HG38 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_323_parts
|
HG38 primary contigs parallelized into 3101 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_3101_parts_unpadded
|
HG38 primary contigs parallelized into 3101 contiguous unmasked regions, with 1kbp padding on two sides of each regions. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_3101_parts
|
HG38 primary contigs parallelized into 20361 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/38/contiguous_unmasked_regions_20361_parts
|
Description |
URI |
---|---|
HG19 primary contigs in 1 partitions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/single_node_workflow
|
HG19 autosomes parallelized into 22 partitions, one autosome per partition |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/autosomes
|
HG19 primary contigs parallelized into 23 partitions, one autosome per partition, and chrX, chrY, and chrM merged into a single partition |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/chromosomes
|
HG19 primary contigs parallelized into 23 partitions, and further including an extra partition with soft-clipped and discordant alignments. It is recommended for structural variation discovery analysis |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/chromosomes+softclip_or_discordant_reads
|
HG19 primary contigs parallelized into 77 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_77_parts
|
HG19 primary contigs parallelized into 77 contiguous unmasked regions, with an extra partition with soft-clipped and discordant alignments, and is recommended for structural variation discovery analysis. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_77_parts+softclip_or_discordant_reads
|
HG19 primary contigs parallelized into 155 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_155_parts
|
HG19 primary contigs parallelized into 323 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_323_parts_unpadded
|
HG19 primary contigs parallelized into 323 contiguous unmasked regions, with 1kbp padding on two sides of each regions. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_323_parts
|
HG19 primary contigs parallelized into 3109 contiguous unmasked regions |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_3109_parts_unpadded
|
HG19 primary contigs parallelized into 3109 contiguous unmasked regions, with 1kbp padding on two sides of each regions. |
https://seqslabbundles.blob.core.windows.net/static/system/bed/19/contiguous_unmasked_regions_3109_parts
|
Description |
URI |
---|---|
HG38 reference genome |
https://seqslabbundles.blob.core.windows.net/static/reference/38/BROAD-PUB-REF/Homo_sapiens_assembly38.dict
|
HG38 reference genome with primary contigs only |
https://seqslabbundles.blob.core.windows.net/static/reference/38/PRIMARY/Homo_sapiens_assembly38.dict
|
HG19 reference genome |
https://seqslabbundles.blob.core.windows.net/static/reference/19/HG/ref.dict
|
HG19 reference genome with primary contigs only |
https://seqslabbundles.blob.core.windows.net/static/reference/19/HG-primary/ref.dict
|