Using the SeqsLab Run Sheet with the CLI

The objective of this tutorial is to demonstrate how you can use SeqsLab Run Sheet and the SeqsLab CLI to automate your sequencing data processing flows. The Run Sheet contains all the mapping and configuration information about the data, workflows, and pipeline execution.

The SeqsLab CLI can take the Run Sheet as a parameter, which simplifies the entire sequencing data processing process, from taking the sequencer output to retrieving the analysis results into just a few SeqsLab CLI commands. You can also eventually transform this flow into a fully automated process.

1. Uploading and registering sample files

As previously explained, you can use the SeqsLab CLI to upload either individual files or entire directories to the Data Hub using the datahub upload command. Alternatively, you can use the SeqsLab Run Sheet to upload sample FASTQ files by preparing the Run Sheet file and then running the datahub upload-runsheet command. Doing so outputs the upload_response.json in stdout. The CLI uses the return code 0 to indicate that all files in the src path were uploaded successfully. Whenever a non-zero return code appears, it means that some of the files failed to upload due to a network issue. When this happens, just run the command again to complete the upload process.

SeqaLab platforms in the cloud with an Azure backend use the Azure Block List API (external link) whenever you run the SeqsLab CLI datahub upload command. This enables files to be programmatically broken up into blocks, uploaded in parallel, and re-assembled in the cloud storage as a block blob (external link). As such, even if the datahub upload command is executed multiple times, all successfully uploaded blocks are kept in the Azure cloud storage as cache and only the failed blocks will be re-transmitted, resulting to a highly efficient and fault-resilient data transmission.

seqslab datahub upload-runsheet \
    --run-sheet /home/run-2022-02-26.csv \
    --input-dir /volume/fastq/2022-02-14/ \
    --workspace seqslabwus2 > upload.json

Running the datahub upload-runsheet command provides an upload_response.json object for each uploaded sample file, as shown below. Apart from automatically populating the storage related fields, the metadata fields are also filled out based on the Sample Sheet information that was extracted from the Run Sheet.

{
    "name": "NA12878-R1_001_R1.fastq.gz",
    "mime_type": "application/gzip",
    "file_type": "fastq.gz",
    "size": 136614814,
    "created_time": "2022-03-03T06:08:23.405391",
    "access_methods": [
        {
            "type": "https",
            "access_url": {
                "url": "https://seqslabapi32b21storage.blob.core.windows.net/seqslab/drs/usr_gNGAlr1m0EYMbEx/seqslab/seqslab/mntcbdh/TestSample/2022_01_18_2/FASTQ/22010262-3M_S39_R1_001.fastq.gz",
                "headers": {
                    "Authorization": null
                }
            },
            "access_tier": "hot",
            "region": "westus2"
        }
    ],
    "checksums": [
        {
            "checksum": "73c643e2d4d473ab339af2360599086b23890a249e3bbc38ca8344606ba109d9",
            "type": "sha256"
        }
    ],
    "status": "complete",
    "description": null,
    "metadata": {
        "header": {
            "IEMFileVersion": "5",
            "Date": "2022_01_18_2",
            "Workflow": "GenerateFASTQ",
            "Application": "NextSeq FASTQ Only",
            "Instrument_Type": "NextSeq/MiniSeq",
            "Assay": "QIASeq FX and cfDNA",
            "Index_Adapters": "QIASeq FX and cfDNA (Plate)",
            "Description": "",
            "Chemistry": "Amplicon"
        },
        "sample": {
            "Sample_ID": "NA12878",
            "Sample_Name": "",
            "Sample_Plate": "",
            "Sample_Well": "",
            "Index_Plate_Well": "B02",
            "I7_Index_ID": "N702",
            "index": "CGTACTAG",
            "I5_Index_ID": "S503",
            "index2": "AGAGGATA",
            "Sample_Project": "",
            "Description": "WGS",
            "DRS_ID": "{header.Date}-{sample.Description}-{sample.Sample_ID}-{file.Pair}",
            "Run_Name": "2022-03-03_WGS_NA12878",
            "Read1_Tag": "wgs/inputRead",
            "Read2_Tag": "",
            "Workflow_URL": "https://api.seqslab.net/trs/v2/tools/trs_G3A9QuumbKxuSvl/versions/1.0/WDL/files/",
            "Runtimes": "",
            "Order_Overall": "5"
        },
        "file": {
            "Pair": "1"
        }
    },
    "tags": [
        "2022-03-03_WGS_NA12878/wgs/inputRead"
    ],
    "aliases": [],
    "id": "2022_01_18_2_PGS_22010262-3M_1"
}

After the sample files are uploaded, you can then use the datahub register command to complete the DRS registration process.

seqslab datahub register \
    file-blob \
    --workspace seqslabwus2 \
    --stdin < upload.json > register.json

2. Executing a job with WES

After the sample files are uploaded and registered as DRS objects, you can proceed to executing a job using the Workflow Execution Service (WES). The SeqsLab CLI provides a jobs request-runsheet command to create a run-request.json for each WES run defined in a Run Sheet. If a Run Sheet defines multiple WES runs, then the jobs request-runsheet command will generate multiple run-request.json files. When the run-request.json files are ready, you can use the jobs run command to launch all WES runs based on the run-request.json files in the working directory. The jobs run command will then respond with a JSON file indicating the submitted run_id and run_name.

The following is an example:

mkdir /home/run-2022-02-26/

seqslab jobs request-runsheet \
    --working-dir /home/run-2022-02-26/ \
    --run-sheet /home/run-2022-02-26.csv

seqslab jobs run \
    --workspace seqslabwus2 \
    --working-dir /home/run-2022-02-26/ \
    --response-path result.json

3. Monitoring a job run

You can use the jobs run-state command to check the status of each run until it reaches the COMPLETE state.

seqslab jobs run-state --run-id run_DdtSfRfOr2AVTSe
{"run_id": "run_DdtSfRfOr2AVTSe", "state": "COMPLETE"}

If you want to get the full run information, you can use the job get command. Running this command returns the detailed WES run information in JSON format. The response includes basic attributes like the run_id, run_name, state, start_time, and end_time for run monitoring. It also includes a logs section containing a list of detailed execution information for each WDL task, such as the rendered command, start_time, end_time, exit_code, storage_url, and outputs. Lastly, it includes an outputs section containing a list of the WDL main-workflow level output mapping from FQN, DRS self-URI, and local file name.

seqslab jobs get --run-id run_DdtSfRfOr2AVTSe
{
    "id": "run_DdtSfRfOr2AVTSe",
    "name": "2022_02_11_WGS_22010402",
    "outputs": [
        {
            "fqn": "WGS.sampleMutect2Vcf",
            "cloud": [
                "drs://api.seqslab.net/drs_FxjCfOIBJ8mm89L"
            ],
            "local": [
                "22010402_Mutect2_tumor.vcf.gz"
            ]
        },
        ...
    ],
    "logs": [
        {
            "id": 1558,
            "name": "bwa-x-4643c-run-ddtsfrfor2avtse",
            "cmd": "set -e -o pipefail\n\n/home/tools/bwa-0.7.17/bwa \\\n    mem -M -t 14 ${refFa} \\\n    -R \"@RG\\tID:NextSeq550_${day}\\tSM:${sampleName}\\tPL:NextSeq\\tPI:550\" \\\n    ${inFileFastq} > \\\n    ${outPathSam} 2>> ${outPathLog}\n\n/home/tools/samtools-1.9/samtools \\\n    view -bS \\\n    ${outPathSam} \\\n    -o tmp.bam\n\n/home/tools/samtools-1.9/samtools \\\n    sort tmp.bam \\\n    -o ${outPathBam}",
            "start_time": "2022-02-11T10:41:43Z",
            "end_time": "2022-02-11T11:10:47Z",
            "stdout": "stdout",
            "stderr": "stderr",
            "activity": "../audit.log",
            "storage_url": "abfss://seqslab@seqslabapi32b21storage.dfs.core.windows.net/outputs/wes/run_DdtSfRfOr2AVTSe/WGS.NIPT.Bwa_x/",
            "exit_code": 0,
            "outputs": [
                {
                    "fqn": "WGS.NIPT.Bwa.outFileBam",
                    "cloud": [
                        "drs://api.seqslab.net/drs_DjKkaETD7x7gZBA"
                    ],
                    "local": [
                        "22010402.bam"
                    ]
                },
                {
                    "fqn": "WGS.NIPT.Bwa.outFileLog",
                    "cloud": [
                        "drs://api.seqslab.net/drs_9Er0LWEMDbbCokV"
                    ],
                    "local": [
                        "22010402_Bwa.log"
                    ]
                }
            ]
        },
        ...
    ],
    "state": "COMPLETE",
    "request": {
        "id": 283,
        "name": "2022_01_18_2_WGS_22010402",
        "description": null,
        "workflow_type": "WDL",
        "workflow_type_version": "1.0",
        "workflow_params": { ...
        }
        "workflow_backend_params": { ...
        },
        "workflow_url": "https://api.seqslab.net/trs/v2/tools/trs_wgs_snp_indel/versions/1.0/WDL/files/",
        "tags": []
    },
    "start_time": "2022-02-11T10:41:28Z",
    "end_time": "2022-02-11T13:47:20Z"
}

4. Retrieving results

Once the run reaches the COMPLETE state, you can retrieve the run result using the datahub download command. Doing so downloads the pipeline run output files to the local machine. This command can take either multiple DRS self-URIs or multiple DRS IDs, and then downloads them into a destination directory.

% seqslab datahub download \
    --workspace seqslabwus2 \
    --dst ~/Downloads/ \
    --self-uri drs://api.seqslab.net/drs_ODlEMzEKxhxwc43 drs://api.seqslab.net/drs_Otr1u9pIYAe2JLr