Data Hub & Dataset Management

Data Hub & Dataset Management#

06/16/2025

The Data Hub is the central repository for all your genomic and biomedical data assets. Built on a Data Lakehouse architecture, it enables you to manage raw datasets, curated Delta tables, and associated metadata in a secure, compliant manner.

data hub

Navigating the Data Catalog#

The Data Hub provides a tabular view of all registered datasets. You can quickly filter and find assets using the following interface elements:

Search Bar: Use the search field at the top right of the table to filter datasets by ID, name or alias.
Label Tags: Custom tags (e.g., References, Project A, Pilot-Study) are displayed as colored badges. Clicking these allows for faceted searching across your organization’s entire data library.
Faceted Filters: Refine your view using the dropdown filters located above the dataset table:
- Storage Region: Filter assets based on their cloud storage access method (e.g., HTTPS, ABFSS westus2).
- File Types: Narrow down results by format (e.g., Delta, FASTQ, VCF, BAM). delta datasets represent high-performance, version-controlled Delta tables.
- Users: Filter datasets by the individual who uploaded or registered the asset.
- Date: Filter by the Uploaded/Registered Date to find the most recent additions or historical data.

Adding New Datasets#

To register or ingest new data into the Lakehouse, use the + Add button located at the top of the left menu. Clicking this button reveals two sub-menu options depending on your data source:

Option 1: Local file…#

Use this option to upload files directly from your computer.

Select Local file… from the sub-menu.
Drag and drop your files or use the file browser to select them.
Select Workspace: Choose the target workspace where your files will be uploaded and stored.
Click Upload to start the transfer process.
Post-Upload Registration: Once the upload to storage is complete, the system automatically registers the dataset and provides the following information:
- Name: The assigned identifier for the dataset.
- Created Time: The timestamp when the dataset was physically created in the storage layer.
- Version: The timestamp reflecting when the dataset was registered or when its information was last updated in the system.
- SHA2 Checksum: A unique cryptographic hash for data integrity verification.

Option 2: URL…#

Use this option to register data already residing in cloud storage (e.g., Azure Blob Storage), open data repositories (e.g., brain-genomics-public), or datasets accessible via HTTPS (e.g., ClinVar database).

Select URL… from the sub-menu.
Specify the fields for datasets.
Click Add URL to add more URLs if registering multiple assets simultaneously.
Once specified, click Save to register the new datasets in the catalog.

Required Fields for Data Governance#

To ensure full traceability and compliance within the Lakehouse, the following fields are mandatory:

Access URL: The validated URL where the data resides, provided for data retrival (abfss, https, s3a, etc.).
Name: A unique, descriptive identifier for the asset.
File size in bytes: The actual size of the dataset file in bytes.
Checksum (optional): Recommended to provide for data integrity validation.

Bulk Importing Datasets#

For large-scale data migration or batch registration of external datasets, SeqsLab supports bulk import via a structured manifest file.

The Bulk Import Workflow#

Click Bulk Import and download the Excel template.
Complete the required fields for all your datasets to be registered.
Upload the completed Excel file and click “IMPORT” to start the process.

Organizing with Metadata and Labels#

Properly organized data is key to reproducible workflows. SeqsLab allows for rich metadata management using a flexible key-value system.

Dataset Metadata#

Metadata in SeqsLab is stored as a list of key-value pairs.

Values: Can be simple strings or complex JSON strings, allowing you to store structured biological or clinical information directly within the dataset profile.
Management:
- Individual: Locate the dataset, Click () and then click Edit Metadata to open the details view and manage metadata entries one by one.
- Bulk: For a list of datasets, you can manage metadata by uploading a completed Excel file containing the key-value mappings for each dataset.
  - Locate and select the target datasets.
  - Click Metadata at the top menu.
  - Download the template (.xlsx) and complete the metadata for each dataset.
  - Drag your completed Excel or use file browser to upload the Excel to import the metadata.

Dataset Labels#

Locate your datasets in the list and click the select all checkbox for the list of actions at the top of the dataset table.

Click ( > Manage Labels) for your dataset or select multiple datasets to batch edit.

Add new labels (Manage labels)
Selection Behavior:
- Select from existing tags in the list.
- Auto-Selection: Previously selected tags for the dataset are checked automatically.
- Removing Labels: Unchecking a tag will automatically remove that label from the selected dataset(s).
Click Save to apply the updates.

Next Steps#

Once your data is organized in the Data Hub, you can proceed to analyze it:

Build and run bioinformatics workflows in a fully managed cluster computing environment.