Use the SeqsLab Run Sheet with the CLI#
The Run Sheet contains all the mapping and configuration information about the data, workflows, and pipeline execution.
The SeqsLab CLI can take the Run Sheet as a parameter, which simplifies the entire sequencing data processing process, from taking the sequencer output to retrieving the analysis results into just a few SeqsLab CLI commands. You can also eventually transform this flow into a fully automated process.
Objective#
This tutorial will help you use the SeqsLab Run Sheet with the SeqsLab CLI to automate your sequencing data processing flows.
Prerequisites#
Before you begin, you will need the following:
A SeqsLab Run Sheet
A running instance of the SeqsLab CLI tool. For details, see Pull and run the SeqsLab CLI.
1. Upload and register sample files#
As previously explained, you can use the SeqsLab CLI to upload either individual files or entire directories to the Data Hub using the datahub upload command. Alternatively, you can use the SeqsLab Run Sheet to upload sample FASTQ files by preparing the Run Sheet file and then running the datahub upload-runsheet command. Doing so outputs the upload_response.json
in stdout
. The CLI uses the return code 0
to indicate that all files in the src path were uploaded successfully. Whenever a non-zero return code appears, it means that some
of the files failed to upload due to a network issue. When this happens, just run the command again to complete the upload process.
The SeqsLab platform uses the Azure Block List API () whenever you run the SeqsLab CLI datahub upload command. This enables files to be programmatically broken up into blocks, uploaded in parallel, and re-assembled in the cloud storage as a block blob (). As such, even if the datahub upload command is executed multiple times, all successfully uploaded blocks are kept in the Azure cloud storage as cache and only the failed blocks will be re-transmitted, resulting to a highly efficient and fault-resilient data transmission.
seqslab datahub upload-runsheet \
--run-sheet /home/run-2022-02-26.csv \
--input-dir /volume/fastq/2022-02-14/ \
--workspace seqslabwus2 > upload.json
Running the datahub upload-runsheet command provides an upload_response.json
object for each uploaded sample file, as shown below. Apart from automatically populating the storage related fields, the metadata fields are also filled out based on the Sample Sheet information that was extracted from the Run Sheet.
{
"name": "NA12878-R1_001_R1.fastq.gz",
"mime_type": "application/gzip",
"file_type": "fastq.gz",
"size": 136614814,
"created_time": "2022-03-03T06:08:23.405391",
"access_methods": [
{
"type": "https",
"access_url": {
"url": "https://seqslabapi32b21storage.blob.core.windows.net/seqslab/drs/usr_gNGAlr1m0EYMbEx/seqslab/seqslab/mntcbdh/TestSample/2022_01_18_2/FASTQ/22010262-3M_S39_R1_001.fastq.gz",
"headers": {
"Authorization": null
}
},
"access_tier": "hot",
"region": "westus2"
}
],
"checksums": [
{
"checksum": "73c643e2d4d473ab339af2360599086b23890a249e3bbc38ca8344606ba109d9",
"type": "sha256"
}
],
"status": "complete",
"description": null,
"metadata": {
"dates": [
{
"date": "20230803",
"type": {
"value": "sequencing"
}
}
],
"types": [
{
"method": {
"value": "NextSeq FASTQ Only"
},
"platform": {
"value": "Illumina"
}
}
],
"privacy": "",
"licenses": [],
"contributors": [],
"extra_properties": [
{
"values": "NA12878",
"category": "Sample_ID"
},
{
"values": "WGSPanCancerTest",
"category": "Description"
},
{
"values": "{$.extra_properties[?category=Date][values]}-{$.extra_properties[?category=Description][values]}-{$.extra_properties[?category=Sample_ID][values]}-{$.extra_properties[?category=Pair][values]}",
"category": "DRS_ID"
},
{
"values": "2023-08-03_WGS_01",
"category": "Run_Name"
},
{
"values": "WGSPanCancerTest/inputRead/1",
"category": "Read1_Label"
},
{
"values": "WGSPanCancerTest/inputRead/2",
"category": "Read2_Label"
},
{
"values": "https://api.seqslab.net/trs/v2/tools/WGSPanCancerTest/versions/0.1.0/WDL/files/",
"category": "Workflow_URL"
},
{
"values": "phenopacket_9k55CH8eowf6PLA",
"category": "phenopacketID"
},
{
"values": "biosample_34sjeicjekqk3ji4",
"category": "BiosampleID"
},
{
"values": "https://api.seqslab.net/trs/v2/tools/WGSPanCancerTest/versions/0.1.0/",
"category": "DiseaseID"
},
{
"values": "1",
"category": "Order_Overall"
},
{
"values": "1",
"category": "Pair"
},
{
"values": "5",
"category": "IEMFileVersion"
},
{
"values": "2023/08/03",
"category": "Date"
}
],
"primary_publication": [],
"alternate_identifiers": []
},
"tags": [
"2022-03-03_WGS_NA12878/wgs/inputRead"
],
"aliases": [],
"id": "2022_01_18_2_PGS_22010262-3M_1"
}
After the sample files are uploaded, you can then use the datahub register command to complete the DRS registration process.
seqslab datahub register \
file-blob \
--workspace seqslabwus2 \
--stdin < upload.json > register.json
2. Execute a job with WES#
After the sample files are uploaded and registered as DRS objects, you can proceed to executing a job using the Workflow Execution Service (WES). The SeqsLab CLI provides a jobs request-runsheet command to create a run-request.json
for each WES run defined in a Run Sheet. If a Run Sheet defines multiple WES runs, then the jobs request-runsheet command will generate multiple run-request.json
files. When the run-request.json
files are ready, you can use the jobs run command to launch all WES runs based on the run-request.json
files in the working directory. The jobs run command will then respond with a JSON file indicating the submitted run_id and run_name.
The following is an example:
mkdir /home/run-2022-02-26/
seqslab jobs request-runsheet \
--working-dir /home/run-2022-02-26/ \
--run-sheet /home/run-2022-02-26.csv
seqslab jobs run \
--workspace seqslabwus2 \
--working-dir /home/run-2022-02-26/ \
--response-path result.json
3. Monitor a job run#
You can use the jobs run-state command to check the status of each run until it reaches the COMPLETE state.
seqslab jobs run-state --run-id run_DdtSfRfOr2AVTSe
{"run_id": "run_DdtSfRfOr2AVTSe", "state": "COMPLETE"}
If you want to get the full run information, you can use the job get command. Running this command returns the detailed WES run information in JSON format. The response includes basic attributes like the run_id, run_name, state, start_time, and end_time for run monitoring. It also includes a logs section containing a list of detailed execution information for each WDL task, such as the rendered command, start_time, end_time, exit_code, storage_url, and outputs. Lastly, it includes an outputs section containing a list of the WDL main-workflow level output mapping from FQN, DRS self-URI, and local file name.
seqslab jobs get --run-id run_DdtSfRfOr2AVTSe
{
"id": "run_DdtSfRfOr2AVTSe",
"name": "2022_02_11_WGS_22010402",
"outputs": [
{
"fqn": "WGS.sampleMutect2Vcf",
"cloud": [
"drs://api.seqslab.net/drs_FxjCfOIBJ8mm89L"
],
"local": [
"22010402_Mutect2_tumor.vcf.gz"
]
},
...
],
"logs": [
{
"id": 1558,
"name": "bwa-x-4643c-run-ddtsfrfor2avtse",
"cmd": "set -e -o pipefail\n\n/home/tools/bwa-0.7.17/bwa \\\n mem -M -t 14 ${refFa} \\\n -R \"@RG\\tID:NextSeq550_${day}\\tSM:${sampleName}\\tPL:NextSeq\\tPI:550\" \\\n ${inFileFastq} > \\\n ${outPathSam} 2>> ${outPathLog}\n\n/home/tools/samtools-1.9/samtools \\\n view -bS \\\n ${outPathSam} \\\n -o tmp.bam\n\n/home/tools/samtools-1.9/samtools \\\n sort tmp.bam \\\n -o ${outPathBam}",
"start_time": "2022-02-11T10:41:43Z",
"end_time": "2022-02-11T11:10:47Z",
"stdout": "stdout",
"stderr": "stderr",
"activity": "../audit.log",
"storage_url": "abfss://seqslab@seqslabapi32b21storage.dfs.core.windows.net/outputs/wes/run_DdtSfRfOr2AVTSe/WGS.NIPT.Bwa_x/",
"exit_code": 0,
"outputs": [
{
"fqn": "WGS.NIPT.Bwa.outFileBam",
"cloud": [
"drs://api.seqslab.net/drs_DjKkaETD7x7gZBA"
],
"local": [
"22010402.bam"
]
},
{
"fqn": "WGS.NIPT.Bwa.outFileLog",
"cloud": [
"drs://api.seqslab.net/drs_9Er0LWEMDbbCokV"
],
"local": [
"22010402_Bwa.log"
]
}
]
},
...
],
"state": "COMPLETE",
"request": {
"id": 283,
"name": "2022_01_18_2_WGS_22010402",
"description": null,
"workflow_type": "WDL",
"workflow_type_version": "1.0",
"workflow_params": { ...
}
"workflow_backend_params": { ...
},
"workflow_url": "https://api.seqslab.net/trs/v2/tools/trs_wgs_snp_indel/versions/1.0/WDL/files/",
"tags": []
},
"start_time": "2022-02-11T10:41:28Z",
"end_time": "2022-02-11T13:47:20Z"
}
4. Retrieve results#
Once the run reaches the COMPLETE state, you can retrieve the run result using the datahub download command. Doing so downloads the pipeline run output files to the local machine. This command can take either multiple DRS self-URIs or multiple DRS IDs, and then downloads them into a destination directory.
% seqslab datahub download \
--workspace seqslabwus2 \
--dst ~/Downloads/ \
--self-uri drs://api.seqslab.net/drs_ODlEMzEKxhxwc43 drs://api.seqslab.net/drs_Otr1u9pIYAe2JLr