Creating a SeqsLab Run Sheet for sequencing experiments

The SeqsLab Run Sheet is a CSV file that was directly extended from the Sample Sheet (external link), a file format used by sequencer providers for storing biological sample information and metadata associated with a given experiment.

Sample Sheet example

[Data]

Sample_ID

Sample_Name

Sample_Plate

Sample_Well

I7_Index_ID

index

I5_Index_ID

index2

Sample_Project

Description

21120276

A701

ATCACGAC

A501

AAGGTTCA

WGS

21120287

A701

ATCACGAC

A501

AAGGTTCA

WES

21070477

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

21120248

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

21120275

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

21120249-t

A701

ATCACGAC

A501

AAGGTTCA

somatic

21120249-n

A701

ATCACGAC

A501

AAGGTTCA

somatic

How the Run Sheet is different

Unlike the Sample Sheet, the Run Sheet further defines six additional columns for each row of data: DRS_ID, Read1_Tag, Read2_Tag, Run_Name, Workflow_URL, and Runtimes. The Run Sheet serves as a dry lab overview plan, specifying the mapping among sequencing sample files, DRS objects, TRS workflows, and WES executions for all samples submitted in a sequencer run. The Run Sheet is also a critical input for the SeqsLab CLI in the dry lab daily routine, serving as a link for the entire process from sample FASTQ uploading, DRS registration, DRS objects to TRS workflows mapping, and eventually to WES execution.

Column

Description

DRS_ID

Associates physical sample files to the data-virtualized DRS object by assigning a DRS_ID rule using the existing Sample Sheet metadata.

Read1_Tag

Associates a read1 DRS object to a specific TRS workflow by assigning a WDL FQN as a DRS object tag to a read 1 DRS object, e.g., WGS.read/1.

Read2_Tag

Associates a read2 DRS object to a specific TRS workflow by assigning a WDL FQN as a DRS object tag to a read 2 DRS object, e.g., WGS.read/2.

Run_Name

Associates DRS objects to a specific WES run by assigning a unique and indicative run name, generally based on the Sample Sheet information, e.g., 2022-02-23_WGS_NA12878. The Run_Name is also used as the Base_Tag for the DRS object.

Workflow_URL

TRS workflow_url, which specifies the TRS to be used for the WES run.

Runtimes

Specifies the WES execution runtimes configuration in the format of key-value pairs of WDL call-name and WES runtime options, e.g., WGS_main_workflow=SeqsLab.Accelerate.GCH1:BWA_mapping_workflow=SeqsLab.Accelerate.GCS1. The default value is an empty string.

DRS_ID rule and supported Sample Sheet metadata list

The SeqsLab CLI uses the sample-sheet package (external link) to do Run Sheet parsing, and generates the Sample Sheet metadata for each individual sequenced FASTQ file during the samples upload and registration process. By default, the Sample Sheet metadata is categorized into header, sample, and file, as shown in the following example.

"metadata": {
    "header": {
        "IEMFileVersion": "5",
        "Date": "2022_02_24",
        "Workflow": "FASTQ",
        "Application": "NextSeq FASTQ Only",
        "Instrument_Type": "NextSeq",
        "Assay": "QIASeq FX and cfDNA",
        "Index_Adapters": "QIASeq FX and cfDNA (Plate)",
        "Description": "",
        "Chemistry": "Amplicon"
    },
    "sample": {
        "Sample_ID": "NA12878",
        "Sample_Name": "",
        "Sample_Plate": "",
        "Sample_Well": "",
        "Index_Plate_Well": "C10",
        "I7_Index_ID": "N000",
        "index": "TGACCAGC",
        "I5_Index_ID": "S000",
        "index2": "TCTTCCAT",
        "Sample_Project": "",
        "Description": "WGS"
    },
    "file": {
        "Pair": "1"
    }

An example DRS_ID rule can use a nested dictionary syntax that is chained with a hyphen (-) as a separator character. For example, {header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}, which will render the DRS_ID as 20220224-WGS-NA12878-1 for the sample FASTQ file NA12878_r1.fastq.gz.

Run_Name

Specify a unique name for a given WES run. Atgenomix recommends creating a Run_Name based on the Sample Sheet metadata to make it both unique and meaningful. For a multi-sample WES run use case, we recommend using the template {header.Date}_{sample.Description} for the Run_Name. For example, using 2022-02-23_WGS, which will be shared by multiple WGS samples in a sequencing run from 2022-02-23. For a single-sample WES run use case, we recommend using the template {header.Date}{sample.Description}{sample.Sample_ID} for a sample-specific Run_Name. For example, using 2022-02-23_WGS_NA12878 for a specific WGS sample NA12878 in a sequencing run from 2022-02-23.

DRS Tags

The SeqsLab DRS service supports tagging, and the Run Sheet uses the tags of {Run_Name}/{Read1_Tag} and {Run_Name}/{Read2_Tag} to associate the DRS object, TRS object, and WES run. To establish the relationship between DRS and WES, Run Sheet uses the Run_Name to associate all DRS objects with the {Run_Name} as the root tag and the corresponding WES run.

To establish the relationship between DRS and TRS, Run Sheet uses Read1_Tag and Read2_Tag to associate the corresponding sequencing sample FASTQ read 1 and read 2 files to a WDL FQN of a TRS object. For example, for a TRS object wrapping a WGS GATK4 SNP/INDEL, the WDL defines the input FASTQ files as WGS_HaplotypeCallerGvcf_GATK4.fastq_files, as seen in the following example:

{
  "WGS_HaplotypeCallerGvcf_GATK4.fastq_files": [
    "NA12878_r1.fq.gz",
    "NA12878_r2.fq.gz"
  ],
  ...
}

By assigning Read1_Tag and Read2_Tag as follows, we can associate the DRS objects of the NA12878_r1.fq.gz and NA12878_r1.fq.gz files to the corresponding WDL FQN of WGS_HaplotypeCallerGvcf_GATK4.fastq_files. The SeqsLab DRS tags support a directory-like, hierarchical query and the FQN separator . is replaced with / in the Read1_Tag and Read2_Tag to enhance future data accessibility.

[Data]

Sample_ID

Description

DRS_ID

Run_Name

Read1_Tag

Read2_Tag

Workflow_URL

Runtimes

NA12878

WGS

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2022-02-23_WGS_NA12878

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/1

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/2

https://api.seqslab.net/trs/v2/tools/trs_WGS/versions/1.0/WDL/files/

This sample mechanism can be extended to a multiple-sample WDL example by assigning an additional layer of tagging to match the two dimension array of the WDL FQN example.

{
  "WGS_HaplotypeCallerGvcf_GATK4.fastq_files": [
    [
        "NA12878_r1.fq.gz",
        "NA12878_r2.fq.gz"
    ],
    [
        "NA12879_r1.fq.gz",
        "NA12879_r2.fq.gz"
    ],
  ],
  ...
}

[Data]

Sample_ID

Description

DRS_ID

Run_Name

Read1_Tag

Read2_Tag

Workflow_URL

Runtimes

NA12878

WGS

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2022-02-23_WGS

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/1/1

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/1/2

https://api.seqslab.net/trs/v2/tools/trs_WGS/versions/1.0/WDL/files/

NA12879

WGS

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2022-02-23_WGS

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/2/1

WGS_HaplotypeCallerGvcf_GATK4/fastq_files/2/2

https://api.seqslab.net/trs/v2/tools/trs_WGS/versions/1.0/WDL/files/

Workflow_URL

Specify a TRS object that is going to be applied on the given sample. By using the SeqsLab CLI tools list command, you can get the URL corresponding to each TRS tool version, and the workflow_url can be obatined by appending the string “{descriptor_type}/files/” to the TRS tool version URL, e.g., “WDL/files/”.

seqslab tools list | grep \"url\"
    "url": "https://api.seqslab.net/trs/v2/tools/trs_wgs/versions/1.0/"
    "url": "https://api.seqslab.net/trs/v2/tools/trs_wgs/versions/2.0/"
    "url": "https://api.seqslab.net/trs/v2/tools/trs_wgs/versions/3.0/"
...

Runtimes

Specify the runtime computation configuration for the WES run. This parameter takes the format of concatenated colon separated key-value pairs, with the key and value respectively indicating a call name between the workflow object model (WOM) graph (external link) of the TRS object and the cluster specification. By default, this field can be left blank, indicating that the main workflow of the TRS object will be executed using the SeqsLab default cluster acu-m8.

Customization fields

Apart from the 6 additional columns, the Run Sheet can take extra customized columns to facilitate other dry lab integration as long as the extended Run Sheet is in the CSV file format. For example, you can add the column Download_FQNs to specify a list of WDL output FQNs. Customized scripts taking the Run Sheet can then be used to parse the given Download_FQNs for each sample, and then download the WDL output FQNs by using the datahub download command.

Run Sheet example

[Data]

Sample_ID

Sample_Name

Sample_Plate

Sample_Well

I7_Index_ID

index

I5_Index_ID

index2

Sample_Project

Description

DRS_ID

Run_Name

Read1_Tag

Read2_Tag

Workflow_URL

Runtimes

21120276

A701

ATCACGAC

A501

AAGGTTCA

WGS

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_WGS_21120276

WGS/inputRead/1

WGS/inputRead/2

https://api.seqslab.net/trs/v2/tools/trs_wgs/versions/1.0/WDL/files/

21120287

A701

ATCACGAC

A501

AAGGTTCA

WES

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_WES_21120287

WES/inputRead/1

WES/inputRead/2

https://api.seqslab.net/trs/v2/tools/trs_wes/versions/1.0/WDL/files/

WES=acu-m8:bamPartition=acu-m16

21070477

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_RNASeq

RNASeq/inputRead/1/1

RNASeq/inputRead/1/2

https://api.seqslab.net/trs/v2/tools/trs_rnaseq/versions/1.0/WDL/files/

21120248

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_RNASeq

RNASeq/inputRead/2/1

RNASeq/inputRead/2/2

https://api.seqslab.net/trs/v2/tools/trs_rnaseq/versions/1.0/WDL/files/

21120275

A701

ATCACGAC

A501

AAGGTTCA

RNASeq

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_RNASeq

RNASeq/inputRead/3/1

RNASeq/inputRead/3/2

https://api.seqslab.net/trs/v2/tools/trs_rbaseq/versions/1.0/WDL/files/

21120249-t

A701

ATCACGAC

A501

AAGGTTCA

somatic

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_somatic

somatic/inputReadTumor/1

somatic/inputReadTumor/2

https://api.seqslab.net/trs/v2/tools/trs_somatic/versions/1.0/WDL/files/

somatic=acu-m64:Calling=acu-m8

21120249-n

A701

ATCACGAC

A501

AAGGTTCA

somatic

{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}

2021-07-29_somatic

somatic/inputReadNormal/1

somatic/inputReadNormal/2

https://api.seqslab.net/trs/v2/tools/trs_somatic/versions/1.0/WDL/files/

somatic=acu-m64:Calling=acu-m8