SeqsLab Data Hub
SeqsLab Data Hub#
The SeqsLab Data Hub is a standardized data repository system that supports the cloud-based data access and retrieval of sample files and tools. Powered by a data lakehouse architecture, the Data Hub provides an integrated interface of data access ranging from blob storage and data lake to data warehousing and relational database.
FAIR principles on SeqsLab#
SeqsLab Data Hub applies the findable, accessible, interoperable, and reusable (FAIR) principles, enabling users to connect, integrate, and manage big data workloads from a secure central repository. All datasets are findable using unique, self-contained, and fully-qualified names (FQNs). The use of FQNs improves data accessibility and interoperability, and provides a standard way to repeatedly access data without manual intervention.
The data repository supports industry-standard file types, from standard genomic files like FASTQ and BAM, to structured electronic medical records (EMRs) and reference files.
The SeqsLab Data Hub likewise implements a GA4GH Data Repository Service API that enables application developers to repeatedly leverage data and functions to build new products and services. The open standard REST API allows businesses to rapidly adapt to meet changing user needs and preferences.
Delta lake support#
The Data Hub is designed to store and process large amounts of varied data at a lower infrastructure cost, and to optimize them for analytics, state-of-the-art SQL, and machine learning. SeqsLab supports the use of delta tables for data management, ensuring that the repository provides version control and is ACID-compliant ().
Life cycle management#
The SeqsLab Data Hub provides a life cycle management function for datasets stored on the SeqsLab data lake storage that can automatically tier down datasets (hot, cool, archive, delete) based on when the file was last accessed.
Compliance and traceability#
All defined workflow jobs and generated output datasets are stored on SeqsLab. You can track and manage your workflows and datasets from the Data Hub.