Upload and register files using the SeqsLab CLI#

Objective#

This tutorial will help you upload and register files to the SeqsLab platform using the SeqsLab CLI tool.

Prerequisites#

Before you begin, you will need the following:

Background#

In most bioinformatics workflows, static reference files such as the grch38 (external link) genome assembly, BWA (external link) indices, and dbSNP (external link) databases are necessary. This is why importing such files is a critical first step when it comes to using existing workflows on the SeqsLab platform.

The SeqsLab implementation of the GA4GH Data Repository Service (DRS) API (external link) helps simplify this process by providing a generic interface for data repositories that works with the SeqsLab CLI.

1. Upload files to the Data Hub#

Static reference files are usually managed in a Linux filesystem that uses the following directory structure:

atgenomix@genomics:/mnt/references$ tree
.
├── hg38
│   ├── Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz
│   ├── Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz.tbi
│   ├── DbSNP.vcf.gz
│   ├── DbSNP.vcf.gz.tbi
│   ├── Homo_sapiens_assembly38.fasta
│   ├── Homo_sapiens_assembly38.fasta.fai
│   ├── Homo_sapiens_assembly38.dict
│   ├── Homo_sapiens_known_indels.vcf.gz
│   ├── Homo_sapiens_known_indels.vcf.gz.tbi
│   ├── Homo_sapiens_known_snps.vcf.gz
│   ├── Homo_sapiens_known_snps.vcf.gz.tbi
│   ├── Mills_and_1000G_gold_standard.indels.vcf.gz
│   ├── Mills_and_1000G_gold_standard.indels.vcf.gz.tbi
│   ├── omni.vcf.gz
│   ├── omni.vcf.gz.tbi
...
├── hg19
...

The SeqsLab CLI makes it easy to upload an entire directory using the datahub upload command. For example, you can upload the hg38 directory to your cloud storage using the following command:

seqslab datahub upload \
    --workspace seqslabwus2 \
    --src /mnt/data/hg38/ \
    --dst myDir/ \
    -r \
    > upload_response.json

Running the SeqsLab CLI datahub upload command will output an upload_response.json file in stdout, and use the return code 0 to indicate that all files in the src path were uploaded successfully. Whenever a non-zero return code appears, it means that some of the files failed to upload due to a network issue. When this happens, you should run the command again to complete the upload process.

Whenever you run the SeqsLab CLI datahub upload command, the files are programmatically broken up into blocks, uploaded in parallel, and re-assembled in the cloud storage as a block blob (external link). As such, even if the datahub upload command is executed multiple times, all successfully uploaded blocks are kept in the cloud storage as cache and only the failed blocks will be re-transmitted, resulting to a highly efficient and fault-resilient data transmission.

Below is an example upload_response.json file which includes a JSON object for each of the uploaded files.

[
    {
        "name": "Homo_sapiens_assembly38.fasta",
        "mime_type": "application/octet-stream",
        "file_type": "dict",
        "size": 2824,
        "created_time": "2022-03-09T07:14:33.308513",
        "access_methods": [
            {
                "type": "https",
                "access_url": {
                    "url": "https://seqslabwus2.blob.core.windows.net/seqslab/drs/usr_gNGAlr1m0EYMbEx/myDir/hg38/Homo_sapiens_assembly38.fasta",
                    "headers": {
                        "Authorization": null
                    }
                },
                "access_tier": "hot",
                "region": "westus2"
            }
        ],
        "checksums": [
            {
                "checksum": "d5b5cd0882ff843eee951158fb160e806f6820a4e9d060835be10c7039cbc131",
                "type": "sha256"
            }
        ],
        "status": "complete",
        "exceptions": null,
        "description": null,
        "metadata": {},
        "tags": [],
        "aliases": [],
        "id": null
    },
    {
        "name": "Homo_sapiens_assembly38.fasta.fai",
        "mime_type": "application/octet-stream",
        "file_type": "fai",
        "size": 788,
        "created_time": "2022-03-09T07:54:49.459589",
        "access_methods": [
            {
                "type": "https",
                "access_url": {
                    "url": "https://seqslabwus2.blob.core.windows.net/seqslab/drs/usr_gNGAlr1m0EYMbEx/myDir/hg38/Homo_sapiens_assembly38.fasta.fai",
                    "headers": {
                        "Authorization": null
                    }
                },
                "access_tier": "hot",
                "region": "westus2"
            }
        ],
        "checksums": [
            {
                "checksum": "94d6b558e7a85aca6068fac4fcad3801107daa7e9ba879df16b478169ad64213",
                "type": "sha256"
            }
        ],
        "status": "complete",
        "exceptions": null,
        "description": null,
        "metadata": {},
        "tags": [],
        "aliases": [],
        "id": null
    },
    ...
]

2. Complete the DRS registration#

After the reference files are uploaded, you can then use the SeqsLab CLI datahub register-blob command to complete the DRS registration. The SeqsLab DRS API supports both file-blob and dir-blob, where either files or folders are registered as individual DRS objects. For most cases, reference files should be registered as a file-blob, so that each file is registered as a single DRS object and can be accessed individually.

# file-blob registration
seqslab datahub register-blob \
    file \     
    --workspace seqslabwus2 \
    --stdin < upload_response.json > register_file.json

One example that would benefit from using the dir option is the annotation database Variant Effect Predictor (external link), which includes 10,700 files which are rarely used individually. In this case, it makes sense to register the directory as a single DRS object.

# VEP folder upload
seqslab datahub upload \
    --workspace seqslabwus2 \
    --src /mnt/data/hg38/VEPCache/ \
    --dst hg38/ \
    -r \
    > upload_folder.json

# dir-blob registration
seqslab datahub register-blob \
    dir \
    --workspace seqslabwus2 \
    --stdin < upload_folder.json > register_folder.json

3. Customize the DRS metadata#

When you run the SeqsLab CLI datahub register-blob command, you will need the upload_response.json as an input file. You can either choose to retain the default values of the file or modify some of the customizable attributes to make your data more searchable and easier to find.

The customizable attributes include tags, description, metadata, aliases, and id. However, note that the ID of a DRS object must be unique and should comply with the regular expression r"^[0-9a-zA-Z\-\_]+$".

Using the previous upload_response.json file, you can modify the customizable attributes as shown in the following example:

[
    {
        "name": "grch38.dict",
        "mime_type": "application/octet-stream",
        "file_type": "dict",
        "size": 2824,
        "created_time": "2022-03-09T07:14:33.308513",
        "access_methods": [
            {
                "type": "https",
                "access_url": {
                    "url": "https://seqslabwus2.blob.core.windows.net/seqslab/drs/usr_gNGAlr1m0EYMbEx/myDir/hg38/Homo_sapiens_assembly38.fasta",
                    "headers": {
                        "Authorization": null
                    }
                },
                "access_tier": "hot",
                "region": "westus2"
            }
        ],
        "checksums": [
            {
                "checksum": "d5b5cd0882ff843eee951158fb160e806f6820a4e9d060835be10c7039cbc131",
                "type": "sha256"
            }
        ],
        "status": "complete",
        "exceptions": null,
        ### customization information
        ############################################################
        "description": "Genome Reference Consortium Human Build 38",
        "metadata": {"Date": "2019/02/28", "Synonyms": "hg38", "Assembly type": "haploid-with-alt-loci"},
        "tags": ["hg38/Homo_sapiens_assembly38.fasta"],
        "aliases": ["grch38.fa", "hg38.fa"],
        "id": "hg38_Homo_sapiens_assembly38-fasta"
        ############################################################
    },

The tags and ID DRS attributes are particularly important when retrieving DRS objects. The following is an example of a Python script (drs_id_tag_customization.py) that can be used to add customized DRS IDs and DRS tags based on the access_url attribute of each JSON object, and to output the updated upload_response.json file via stdout.

import json
import sys


def main(arg):
    data = json.load(sys.stdin)
    for obj in data:
        try:
            dst = obj.get('access_methods')[0].get('access_url').get('url')
            comb = [item for item in dst.split('/')[6:]]
            if arg.tag:
                obj['tags'].append('/'.join(comb))
            if arg.id:
                obj['id'] = ('_'.join([item.replace('.', '-') for item in dst.split('/')[6:]]))
        except:
            print("object does not have access_methods.access_url.url")
    print(json.dumps(data, indent=4))


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--tag", help='add customized tag based on absolute path', action='store_true', required=False)
    parser.add_argument("--id", help='add customized id based on absolute path', action='store_true', required=False)

    args = parser.parse_args()
    main(args)

By chaining the drs_id_tag_customization.py and the datahub register-blob command, you can customize the DRS IDs and DRS tags, which can help minimize the effort of preparing the SeqsLab execs.json file.

cat response_upload.json | python3 drs_id_tag_customization.py --tag --id |
    seqslab datahub register-blob \
        file \     
        --workspace seqslabwus2 \
        --stdin \
        > register_file.json