Create a SeqsLab Run Sheet for sequencing experiments#

The SeqsLab Run Sheet is a CSV file that was directly extended from the Sample Sheet (external link), a file format used by sequencer providers for storing biological sample information and metadata associated with a given experiment.

Objective#

This tutorial will help you create a SeqsLab Run Sheet.

Prerequisites#

Before you begin, you will need the following:

Sample Sheet example#

[Data]

Sample_ID

Sample_Name

Sample_Plate

Sample_Well

I7_Index_ID

index

I5_Index_ID

index2

Sample_Project

Description

21120276

A701

ATCACGAC

A501

AAGGTTCA

WGS

21120287

A701

ATCACGAC

A501

AAGGTTCA

WES

21070477

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

21120248

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

21120275

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

21120249-t

A701

ATCACGAC

A501

AAGGTTCA

somatic

21120249-n

A701

ATCACGAC

A501

AAGGTTCA

somatic

How the Run Sheet is different#

Unlike the Sample Sheet, the Run Sheet further defines six additional columns for each row of data: DRS_ID, Read1_Label, Read2_Label, Run_Name, Workflow_URL, and Runtimes. The Run Sheet serves as a dry lab overview plan, specifying the mapping among sequencing sample files, DRS objects, TRS workflows, and WES executions for all samples submitted in a sequencer run. The Run Sheet is also a critical input for the SeqsLab CLI in the dry lab daily routine, serving as a link for the entire process from sample FASTQ uploading, DRS registration, DRS objects to TRS workflows mapping, and eventually to WES execution.

To create your own Run Sheet, you will need to modify the sample sheet template to include the following fields:

Column

Description

DRS_ID

Associates physical sample files to the data-virtualized DRS object by assigning a DRS_ID rule using the existing Sample Sheet metadata.

Read1_Label

Associates a read1 DRS object to a specific TRS workflow by assigning a WDL FQN as a DRS object label to a read 1 DRS object, e.g., WGS.read/1.

Read2_Label

Associates a read2 DRS object to a specific TRS workflow by assigning a WDL FQN as a DRS object label to a read 2 DRS object, e.g., WGS.read/2.

Run_Name

Associates DRS objects to a specific WES run by assigning a unique and indicative run name, generally based on the Sample Sheet information, e.g., 2022-02-23_WGS_NA12878. The Run_Name is also used as the Base_Label for the DRS object.

Workflow_URL

TRS workflow_url, which specifies the TRS to be used for the WES run.

Runtimes

Specifies the WES execution runtimes configuration in the format of key-value pairs of WDL call-name and WES runtime options, e.g., WGS_main_workflow=SeqsLab.Accelerate.GCH1:BWA_mapping_workflow=SeqsLab.Accelerate.GCS1. The default value is an empty string.

DRS_ID rule and supported Sample Sheet metadata list#

The SeqsLab CLI uses the sample-sheet package (external link) to do Run Sheet parsing, and generates the Sample Sheet metadata for each individual sequenced FASTQ file during the samples upload and registration process.

"metadata": {
    "dates": [
        {
            "date": "20230803",
            "type": {
                "value": "sequencing"
            }
        }
    ],
    "types": [
        {
            "method": {
                "value": "NextSeq FASTQ Only"
            },
            "platform": {
                "value": "Illumina"
            }
        }
    ],
    "privacy": "",
    "licenses": [],
    "contributors": [],
    "extra_properties": [
        {
            "values": "NA12878",
            "category": "Sample_ID"
        },
        {
            "values": "WGSPanCancerTest",
            "category": "Description"
        },
        {
            "values": "{$.extra_properties[?category=Date][values]}-{$.extra_properties[?category=Description][values]}-{$.extra_properties[?category=Sample_ID][values]}-{$.extra_properties[?category=Pair][values]}",
            "category": "DRS_ID"
        },
        {
            "values": "2023-08-03_WGS_01",
            "category": "Run_Name"
        },
        {
            "values": "WGSPanCancerTest/inputRead/1",
            "category": "Read1_Label"
        },
        {
            "values": "WGSPanCancerTest/inputRead/2",
            "category": "Read2_Label"
        },
        {
            "values": "https://api.seqslab.net/trs/v2/tools/WGSPanCancerTest/versions/0.1.0/WDL/files/",
            "category": "Workflow_URL"
        },
        {
            "values": "phenopacket_9k55CH8eowf6PLA",
            "category": "phenopacketID"
        },
        {
            "values": "biosample_34sjeicjekqk3ji4",
            "category": "BiosampleID"
        },
        {
            "values": "https://api.seqslab.net/trs/v2/tools/WGSPanCancerTest/versions/0.1.0/",
            "category": "DiseaseID"
        },
        {
            "values": "1",
            "category": "Order_Overall"
        },
        {
            "values": "1",
            "category": "Pair"
        },
        {
            "values": "5",
            "category": "IEMFileVersion"
        },
        {
            "values": "2023/08/03",
            "category": "Date"
        }
    ],
    "primary_publication": [],
    "alternate_identifiers": []
}

An example DRS_ID rule can given based on jsonpath that is chained with a hyphen (-) as a separator character. For example, {$.extra_properties[?category=Date][values]}-{$.extra_properties[?category=Description][values]}-{$.extra_properties[?category=Sample_ID][values]}-{$.extra_properties[?category=Pair][values]}, which will render the DRS_ID as 20230803-WGSPanCancerTest-NA12878-1 for the sample FASTQ file NA12878_r1.fastq.gz.

Run_Name#

Specify a unique name for each WES run. Atgenomix recommends creating a Run_Name based on the Sample Sheet metadata to make it both unique and meaningful.

For a multi-sample WES run use case, we recommend using the template *{$.extra_properties[?category=Date][values]}-{$.extra_properties[?category=Description][values]}* for the Run_Name. For example, you can use 20230803-WGSPanCancerTest for a WES run that will be shared by multiple WGS samples in a sequencing run from 20230803.

For a single-sample WES run use case, we recommend using the template *{$.extra_properties[?category=Date][values]}-{$.extra_properties[?category=Description][values]}-{$.extra_properties[?category=Sample_ID][values]}* for a sample-specific Run_Name. For example, you can use 20230803_WGSPanCancerTest_NA12878` for a specific WGS NA12878 sample in a sequencing run from 2023-08-03.

DRS Labels#

The SeqsLab DRS service supports labeling, and the Run Sheet uses the labels of {Run_Name}/{Read1_Label} and {Run_Name}/{Read2_Label} to associate the DRS object, TRS object, and WES run. To establish the relationship between DRS and WES, Run Sheet uses the Run_Name to associate all DRS objects with the {Run_Name} as the root label and the corresponding WES run.

To establish the relationship between DRS and TRS, Run Sheet uses Read1_Label and Read2_Label to associate the corresponding sequencing sample FASTQ read 1 and read 2 files to a WDL FQN of a TRS object. For example, for a TRS object wrapping a WGS GATK4 SNP/INDEL, the WDL defines the input FASTQ files as WGS_HaplotypeCallerGvcf_GATK4.fastq_files.

{
  "WGS_HaplotypeCallerGvcf_GATK4.fastq_files": [
    "NA12878_r1.fq.gz",
    "NA12878_r2.fq.gz"
  ],
  ...
}

By assigning the Read1_Label and Read2_Label as follows, we can associate the DRS objects of the NA12878_r1.fq.gz and NA12878_r1.fq.gz files to the corresponding WDL FQN of WGS_HaplotypeCallerGvcf_GATK4.fastq_files. The SeqsLab DRS labels support a directory-like, hierarchical query and the FQN separator . is replaced with / in the Read1_Label and Read2_Label to enhance future data accessibility.

[Data]

Sample_ID

Description

DRS_ID

Run_Name

Read1_Label

Read2_Label

Workflow_URL

Runtimes

NA12878

WGS

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2022-02-23_WGS_NA12878

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/1

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/2

https://api.seqslab.net/trs/v2/tools/trs_WGS/versions/1.0/WDL/files/

This sample mechanism can be extended to a multiple-sample WDL example by assigning an additional layer of labeling to match the two-dimension array of the WDL FQN example.

{
  "WGS_HaplotypeCallerGvcf_GATK4.fastq_files": [
    [
        "NA12878_r1.fq.gz",
        "NA12878_r2.fq.gz"
    ],
    [
        "NA12879_r1.fq.gz",
        "NA12879_r2.fq.gz"
    ],
  ],
  ...
}

[Data]

Sample_ID

Description

DRS_ID

Run_Name

Read1_Label

Read2_Label

Workflow_URL

Runtimes

NA12878

WGS

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2022-02-23_WGS

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/1/1

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/1/2

https://api.seqslab.net/trs/v2/tools/trs_WGS/versions/1.0/WDL/files/

NA12879

WGS

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2022-02-23_WGS

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/2/1

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/2/2

https://api.seqslab.net/trs/v2/tools/trs_WGS/versions/1.0/WDL/files/

Workflow_URL#

Specify a TRS object that is going to be applied on the given sample. By using the SeqsLab CLI tools list command, you can get the URL corresponding to each TRS tool version, and the workflow_url can be obtained by appending the string {descriptor_type}/files/ to the TRS tool version URL. For example, WDL/files/.

seqslab tools list | grep \"url\"
    "url": "https://api.seqslab.net/trs/v2/tools/trs_wgs/versions/1.0/"
    "url": "https://api.seqslab.net/trs/v2/tools/trs_wgs/versions/2.0/"
    "url": "https://api.seqslab.net/trs/v2/tools/trs_wgs/versions/3.0/"
...

Runtimes#

Specify the runtime computation configuration for the WES run. This parameter takes the format of concatenated colon separated key-value pairs, with the key and value respectively indicating a call name between the workflow object model (WOM) graph (external link) of the TRS object and the cluster specification. By default, this field can be left blank, indicating that the main workflow of the TRS object will be executed using the SeqsLab default cluster acu-m8.

Customization fields#

Apart from the 6 additional columns, the Run Sheet can take extra customized columns to facilitate other dry lab integration as long as the extended Run Sheet is in the CSV file format. For example, you can add the column Download_FQNs to specify a list of WDL output FQNs. Customized scripts taking the Run Sheet can then be used to parse the given Download_FQNs for each sample, and then download the WDL output FQNs by using the datahub download command.

In another example, ga4gh phenopacket information, e.g. PhenopacketID, BiosampleID, and DiseaseID can also be added as customization fields, so that association of phenotypic or clinical information to the DRS object and the WES can be achieved.

Column

Description

PhenopacketID

Associate DRS objects to a specific ga4gh phenopacket object indicating phenotypic information.

BiosampleID

Associate DRS objects to a specific ga4gh phenopacket biosample object indicating the biology sample information.

DiseaseID

Associate DRS objects to a specific disease or testing, usually an ontology ID

Run Sheet example#

[Data]

Sample_ID

Sample_Name

Sample_Plate

Sample_Well

I7_Index_ID

index

I5_Index_ID

index2

Sample_Project

Description

DRS_ID

Run_Name

Read1_Label

Read2_Label

Workflow_URL

Runtimes

PhenopacketID

BiosampleID

DiseaseID

21120276

A701

ATCACGAC

A501

AAGGTTCA

WGS

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_WGS_21120276

WGS/inputRead/1

WGS/inputRead/2

https://api.seqslab.net/trs/v2/tools/trs_wgs/versions/1.0/WDL/files/

PH_Me71Y8tCewj2Z

BS_Me71Y8tCewj2Z

DIS_493LaFKEkOf8I

21120287

A701

ATCACGAC

A501

AAGGTTCA

WES

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_WES_21120287

WES/inputRead/1

WES/inputRead/2

https://api.seqslab.net/trs/v2/tools/trs_wes/versions/1.0/WDL/files/

WES=acu-m8:bamPartition=acu-m16

PH_7kQK62zwrxPkc

BS_7kQK62zwrxPkc

DIS_493LaFKEkOf8I

21070477

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_RNASeq

RNASeq/inputRead/1/1

RNASeq/inputRead/1/2

https://api.seqslab.net/trs/v2/tools/trs_rnaseq/versions/1.0/WDL/files/

PH_NvIhjULtvZol5

BS_NvIhjULtvZol5

DIS_FI9n2VwXWRzBd

21120248

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_RNASeq

RNASeq/inputRead/2/1

RNASeq/inputRead/2/2

https://api.seqslab.net/trs/v2/tools/trs_rnaseq/versions/1.0/WDL/files/

PH_Q9tkQyWcGZ30x

BS_Q9tkQyWcGZ30x

DIS_FI9n2VwXWRzBd

21120275

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_RNASeq

RNASeq/inputRead/3/1

RNASeq/inputRead/3/2

https://api.seqslab.net/trs/v2/tools/trs_rbaseq/versions/1.0/WDL/files/

PH_nNiKxmXxX2rYU

BS_nNiKxmXxX2rYU

DIS_FI9n2VwXWRzBd

21120249-t

A701

ATCACGAC

A501

AAGGTTCA

somatic

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_somatic

somatic/inputReadTumor/1

somatic/inputReadTumor/2

https://api.seqslab.net/trs/v2/tools/trs_somatic/versions/1.0/WDL/files/

somatic=acu-m64:Calling=acu-m8

PH_fkhYMoRRyT05T

BS_fkhYMoRRyT05T

DIS_V7wjmISIvC7xD

21120249-n

A701

ATCACGAC

A501

AAGGTTCA

somatic

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_somatic

somatic/inputReadNormal/1

somatic/inputReadNormal/2

https://api.seqslab.net/trs/v2/tools/trs_somatic/versions/1.0/WDL/files/

somatic=acu-m64:Calling=acu-m8

PH_im731GhGK86n5

BS_im731GhGK86n5

DIS_V7wjmISIvC7xD