(cli:tutorial-drs)=
# Upload and register files using the SeqsLab CLI

## Objective
This tutorial will help you upload and register files to the SeqsLab platform using the SeqsLab CLI tool.
## Prerequisites

Before you begin, you will need the following:

- SeqsLab managed application on Azure. For details, see [](testdrive:overview) or [](csp:azure-deployment).
- A running instance of the SeqsLab CLI tool. For details, see [](cli:tutorial-getting-started).
- A command line interface (CLI) tool such as the Windows Command Prompt or the Mac Terminal


## Background

In most bioinformatics workflows, static reference files such as the [grch38](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) (![external link](../images/external-link.png)) genome assembly, [BWA](https://github.com/lh3/bwa) (![external link](../images/external-link.png)) indices, 
and [dbSNP](https://www.ncbi.nlm.nih.gov/snp/) (![external link](../images/external-link.png)) databases are necessary. This is why importing such files is a critical first step when it comes to using existing workflows on the SeqsLab platform.

The [SeqsLab implementation](api:drs) of the [GA4GH Data Repository Service (DRS) API](https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.2.0/docs/) (![external link](../images/external-link.png)) helps simplify this process by providing a generic interface for data repositories that works with the SeqsLab CLI.

## 1. Upload files to the Data Hub
Static reference files are usually managed in a Linux filesystem that uses the following directory structure:

```
atgenomix@genomics:/mnt/references$ tree
.
├── hg38
│   ├── Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz
│   ├── Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz.tbi
│   ├── DbSNP.vcf.gz
│   ├── DbSNP.vcf.gz.tbi
│   ├── Homo_sapiens_assembly38.fasta
│   ├── Homo_sapiens_assembly38.fasta.fai
│   ├── Homo_sapiens_assembly38.dict
│   ├── Homo_sapiens_known_indels.vcf.gz
│   ├── Homo_sapiens_known_indels.vcf.gz.tbi
│   ├── Homo_sapiens_known_snps.vcf.gz
│   ├── Homo_sapiens_known_snps.vcf.gz.tbi
│   ├── Mills_and_1000G_gold_standard.indels.vcf.gz
│   ├── Mills_and_1000G_gold_standard.indels.vcf.gz.tbi
│   ├── omni.vcf.gz
│   ├── omni.vcf.gz.tbi
...
├── hg19
...
```

The SeqsLab CLI makes it easy to upload an entire directory using the ***datahub upload*** command. For example, you can upload the hg38 directory to your cloud storage using the following command:

```
seqslab datahub upload \
    --workspace seqslabwus2 \
    --src /mnt/data/hg38/ \
    --dst myDir/ \
    -r \
    > upload_response.json
```

Running the SeqsLab CLI ***datahub upload*** command will output an `upload_response.json` file in `stdout`, and use the return code `0` to indicate that all files in the **src** path were uploaded successfully. Whenever a non-zero return code appears, it means that some 
of the files failed to upload due to a network issue. When this happens, you should run the command again to complete the upload process.  
 
Whenever you run the SeqsLab CLI ***datahub upload*** command, the files are programmatically broken up into blocks, uploaded in parallel, and re-assembled in the cloud
storage as a [block blob](https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs) (![external link](../images/external-link.png)). As such, even if the ***datahub upload*** command is executed multiple times, all successfully uploaded blocks are kept in the cloud storage as cache and only the failed blocks will be re-transmitted, resulting to a highly efficient and fault-resilient data transmission. 

Below is an example `upload_response.json` file which includes a JSON object for each of the uploaded files.

```
[
    {
        "name": "Homo_sapiens_assembly38.fasta",
        "mime_type": "application/octet-stream",
        "file_type": "dict",
        "size": 2824,
        "created_time": "2022-03-09T07:14:33.308513",
        "access_methods": [
            {
                "type": "https",
                "access_url": {
                    "url": "https://seqslabwus2.blob.core.windows.net/seqslab/drs/usr_gNGAlr1m0EYMbEx/myDir/hg38/Homo_sapiens_assembly38.fasta",
                    "headers": {
                        "Authorization": null
                    }
                },
                "access_tier": "hot",
                "region": "westus2"
            }
        ],
        "checksums": [
            {
                "checksum": "d5b5cd0882ff843eee951158fb160e806f6820a4e9d060835be10c7039cbc131",
                "type": "sha256"
            }
        ],
        "status": "complete",
        "exceptions": null,
        "description": null,
        "metadata": {},
        "tags": [],
        "aliases": [],
        "id": null
    },
    {
        "name": "Homo_sapiens_assembly38.fasta.fai",
        "mime_type": "application/octet-stream",
        "file_type": "fai",
        "size": 788,
        "created_time": "2022-03-09T07:54:49.459589",
        "access_methods": [
            {
                "type": "https",
                "access_url": {
                    "url": "https://seqslabwus2.blob.core.windows.net/seqslab/drs/usr_gNGAlr1m0EYMbEx/myDir/hg38/Homo_sapiens_assembly38.fasta.fai",
                    "headers": {
                        "Authorization": null
                    }
                },
                "access_tier": "hot",
                "region": "westus2"
            }
        ],
        "checksums": [
            {
                "checksum": "94d6b558e7a85aca6068fac4fcad3801107daa7e9ba879df16b478169ad64213",
                "type": "sha256"
            }
        ],
        "status": "complete",
        "exceptions": null,
        "description": null,
        "metadata": {},
        "tags": [],
        "aliases": [],
        "id": null
    },
    ...
]
```

## 2. Complete the DRS registration
After the reference files are uploaded, you can then use the SeqsLab CLI ***datahub register-blob*** command to complete the DRS registration. The SeqsLab DRS API supports both **file-blob** and **dir-blob**, where either files or folders are registered as individual DRS objects. For most cases, reference files should be registered as a **file-blob**, so that each file is registered as a single DRS object and can be accessed individually.

```
# file-blob registration
seqslab datahub register-blob \
    file \     
    --workspace seqslabwus2 \
    --stdin < upload_response.json > register_file.json
```

One example that would benefit from using the **dir** option is the annotation database [Variant Effect Predictor](https://m.ensembl.org/info/docs/tools/vep/script/vep_cache.html#cache) (![external link](../images/external-link.png)), which includes
10,700 files which are rarely used individually. In this case, it makes sense to register the directory as a single DRS object.

```
# VEP folder upload
seqslab datahub upload \
    --workspace seqslabwus2 \
    --src /mnt/data/hg38/VEPCache/ \
    --dst hg38/ \
    -r \
    > upload_folder.json

# dir-blob registration
seqslab datahub register-blob \
    dir \
    --workspace seqslabwus2 \
    --stdin < upload_folder.json > register_folder.json
```

(drs:customize-metadata)=
## 3. Customize the DRS metadata

When you run the SeqsLab CLI ***datahub register-blob*** command, you will need the `upload_response.json` as an input file. You can either choose to retain the default values of the file or modify some of the customizable attributes to make your data more searchable and easier to find.  

The customizable attributes include *tags*, *description*, *metadata*, *aliases*, and *id*. However, note that the ID of a DRS object must be unique and should comply with the regular expression `r"^[0-9a-zA-Z\-\_]+$"`.  

Using the previous `upload_response.json` file, you can modify the customizable attributes as shown in the following example:

```
[
    {
        "name": "grch38.dict",
        "mime_type": "application/octet-stream",
        "file_type": "dict",
        "size": 2824,
        "created_time": "2022-03-09T07:14:33.308513",
        "access_methods": [
            {
                "type": "https",
                "access_url": {
                    "url": "https://seqslabwus2.blob.core.windows.net/seqslab/drs/usr_gNGAlr1m0EYMbEx/myDir/hg38/Homo_sapiens_assembly38.fasta",
                    "headers": {
                        "Authorization": null
                    }
                },
                "access_tier": "hot",
                "region": "westus2"
            }
        ],
        "checksums": [
            {
                "checksum": "d5b5cd0882ff843eee951158fb160e806f6820a4e9d060835be10c7039cbc131",
                "type": "sha256"
            }
        ],
        "status": "complete",
        "exceptions": null,
        ### customization information
        ############################################################
        "description": "Genome Reference Consortium Human Build 38",
        "metadata": {"Date": "2019/02/28", "Synonyms": "hg38", "Assembly type": "haploid-with-alt-loci"},
        "tags": ["hg38/Homo_sapiens_assembly38.fasta"],
        "aliases": ["grch38.fa", "hg38.fa"],
        "id": "hg38_Homo_sapiens_assembly38-fasta"
        ############################################################
    },
```

The *tags* and *ID* DRS attributes are particularly important when retrieving DRS objects. The following is an example of a Python script (`drs_id_tag_customization.py`) that can be used to add customized DRS IDs and DRS tags based on the *access_url* attribute of each JSON object, and to output the updated `upload_response.json` file via `stdout`.

```
import json
import sys


def main(arg):
    data = json.load(sys.stdin)
    for obj in data:
        try:
            dst = obj.get('access_methods')[0].get('access_url').get('url')
            comb = [item for item in dst.split('/')[6:]]
            if arg.tag:
                obj['tags'].append('/'.join(comb))
            if arg.id:
                obj['id'] = ('_'.join([item.replace('.', '-') for item in dst.split('/')[6:]]))
        except:
            print("object does not have access_methods.access_url.url")
    print(json.dumps(data, indent=4))


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--tag", help='add customized tag based on absolute path', action='store_true', required=False)
    parser.add_argument("--id", help='add customized id based on absolute path', action='store_true', required=False)

    args = parser.parse_args()
    main(args)
```

By chaining the `drs_id_tag_customization.py` and the ***datahub register-blob*** command, you can customize the DRS IDs and DRS tags, which can help minimize the effort of preparing the [SeqsLab execs.json](cli:tutorial-trs-execs) file.

```
cat response_upload.json | python3 drs_id_tag_customization.py --tag --id |
    seqslab datahub register-blob \
        file \     
        --workspace seqslabwus2 \
        --stdin \
        > register_file.json
```
