Hi-C Data Processing#

24/03/2022

Hi-C data is often used to analyze genome-wide chromatin organization, such as topologically associating domains (TADs), which are linearly contiguous regions of the genome that are associated in 3D space. Atgenomix worked with Professor Jia-Ming Chang (external link) of the National Chengchi University in Taiwan to develop a Hi-C data analysis pipeline on the Seqslab Platform.

In this case study, we first standardize and demonstrate a Hi-C data processing pipeline with WDL and an open-source tool HiC-Pro (external link). Afterwards, we then accelerate the Hi-C pipeline based on different operator pipeline configurations and finally, we show the performance evaluation.

Main WDL#

With the aide of Zhan-Wei Lin, the Hi-C data processing pipeline is standardized via WDL as shown in this HiC-Pro_0.0.6.wdl file (external link). Inside the main workflow, HiC-Docker, there are no subworkflows and all the tasks are called directly. In addition, in order to support multiple samples execution, a scatter operation is applied inside the main workflow.

Based on the WDL, an inputs.json file is provided as follows.

{
    "Hic_Docker.sampleName": "SRR400264_00",
    "Hic_Docker.tableFile": "/datadrive/input_files/annotation/chrom_hg19.sizes",
    "Hic_Docker.dockerImage": "atgenomix.azurecr.io/atgenomix/seqslab_runtime-1.5_ubuntu-20.04_hicpro:2021-11-08-08-00",
    "Hic_Docker.hicPath": "/HiC-Pro_3.1.0",
    "Hic_Docker.referenceGenome": "hg19",
    "Hic_Docker.bowtie2LocalOptions": "--very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder",
    "Hic_Docker.bowtie2GlobalOptions": "--very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder",
    "Hic_Docker.binSize": "20000",
    "Hic_Docker.bowtieIndexPath": "/datadrive/input_files/btw2idx/hg19",
    "Hic_Docker.bedFile": "/datadrive/input_files/annotation/HindIII_resfrag_hg19.bed",
    "Hic_Docker.nCPU": "1",
    "Hic_Docker.cutsite": "AAGCTAGCTT",
    "Hic_Docker.nMem": "100",
    "Hic_Docker.inputSamples": [["/datadrive/input_files/fastq/SRR1030718_1.fastq", "/datadrive/input_files/fastq/SRR1030718_2.fastq"]],
    "Hic_Docker.inputSettings": [["SRR400264_00", "Dixon2M"]],
    "Hic_Docker.tagR1": "_R1",
    "Hic_Docker.tagR2": "_R2",
    "Hic_Docker.resName": "Dixon2M",
    "Hic_Docker.fakeFile": "test"
}

WOM Graph#

Parsed by the Cromwell womtool, we obtain the following WOM Graph which describes the Hi-C data pipeline, including tasks such as Bowtie_Global_Mapping, Bowtie_Local_Trimming, Bowtie_Local_Mapping, Mapping_Combine, Mapping_Stat, Merge_Pairs, Mapped_Hic_Fragments, Merge_Valid_Interaction, Making_Plot, Build_Matrix, and Ice_Normalization.

WOM Graph of main wdl

Parallelization Settings#

In this specific example, the FASTQ GZ files are around 6.2GB (R1) and 6.4GB (R2). By applying the following chunking settings, we can seperate these two FASTQ files into 53 partitions.

{
    "fqn": "Hic_Docker.Bowtie_Global_Mapping.sampleR1",
    "operators": {
        "format": {
            "class": "com.atgenomix.seqslab.piper.operators.format.FastqFormat"
        },
        "p_pipe": {
            "class": "com.atgenomix.seqslab.piper.operators.pipe.PPipe"
        },
        "source": {
            "class": "com.atgenomix.seqslab.piper.operators.source.FastqSource",
            "codec": "org.seqdoop.hadoop_bam.util.BGZFCodec"
        },
        "partition": {
            "class": "com.atgenomix.seqslab.piper.operators.partition.FastqPartition",
            "parallelism": "3145728"
        }
    },
    "pipelines": {
        "call": [
            "format",
            "p_pipe"
        ],
        "input": [
            "source",
            "partition",
            "format"
        ]
    }
}

Data Consistency#

In order to validate data consistency between Parallel/Non-Parallel execution, we verify the task output, Merge_Valid_Interaction, where it shows that the pair number in different categories are the same.

Statistics

Non Parallelization

Parallelization

valid_interaction

52734597

52734597

valid_interaction_rmdup

47916253

47916253

trans_interaction

10867055

10867055

cis_interaction

37049198

37049198

cis_shortRange

3556197

3556197

cis_longRange

33493001

33493001

Performance evaluation#

Based on the test results, we can see that by applying parallel settings, the execution time is largely reduced. However, the reduced execution time is not fully linear as there are some added overhead and some tasks are not fully parallizable.

It is worth mentioning that acceleration cannot be achieved by unlimitedly increasing the VM node number as you can observe that the performance with D13_v2 x 10 is very similar with D13_v2 x 20. It is because the FASTQ data is only chunked into 53 segments. In the case of D13_v2 x 10, a total of 80 CPU can be used at the same time which is already larger than the 53 segments. Therefore, allocating even more D13_v2 VM nodes is unnecessary and useless in this case.

Note

Azure D13_v2 VM has specification 8 cores CPU + 56GB RAM.

WOM Graph of main wdl

Performance (Time)

Non-Parallel Azure D13_v2 x 1

Parallel Azure D13_v2 x 3

Parallel Azure D13_v2 x 10

Parallel Azure D13_v2 x 20

Bowtie_Global_Mapping

8h:01m:51s

1h:24m:14s

39m:52s

37m:11s

Bowtie_Local_Trimming

1m:58s

1m:16s

1m:25s

1m:34s

Bowtie_Local_Mapping

1h:27m:51s

6m:50s

3m:08s

3m:19s

Mapping_Combine

43m:42s

3m:35s

1m:55s

1m:41s

Mapping_Stat

10m:24s

2m:21s

1m:24s

1m:16s

Merge_Pairs

33m:48s

4m:29s

2m:37s

1m:42s

Mapped_Hic_Fragments

1h:51m:29s

12m:0s

4m:40s

4m:37s

Merge_Valid_Interaction

3m:28s

4m:11s

3m:47s

3m:57s

Making_Plot

1m:10s

1m:27s

1m:44s

1m:21s

Build_Matrix

1m:37s

2m:0s

2m:31s

1m:41s

Ice_Normalization

2m:16s

3m:22s

2m:54s

3m:04s

Total

13h:13m:6s

2h:4m:22s

1h:4m:06s

1h:1m:08s