Job lifecycle and run exit codes#

When you submit a workflow execution service request (job) through API, CLI, or Console, SeqsLab performs automated or interactive job execution and provides ongoing run status updates throughout the entire job lifecycle. You can monitor the run status and easily pinpoint the causes of errors through exit codes when job execution fails.

Run states#

Run can be in one of the following states depending on its job type, Automated job or Interactive job:

Automated jobs#

State

Description

UNKNOWN

Job has been created into SeqsLab, but no backend instance has been requested.

QUEUED

The request for a backend instance has been dispatched, and the job is currently awaiting the provisioning of backend resources.

INITIALIZING

SeqsLab is initializing to process the automated job.

RUNNING

SeqsLab is handling tasks of WDL supplied by the Tool Registry Service (TRS).

COMPLETE

Job has been successfully executed.

EXECUTOR_ERROR

Error occurs when preparing environments or executing commands written within Workflow Description Language (WDL).

SYSTEM_ERROR

Errors unrelated to WDL commands occur during job exeution. Please refer to the exit codes for additional information.

CANCELING

Job cancellation has been triggered by either user or SeqsLab itself, prompting SeqsLab to commence the removal of all associated resources.

CANCELED

Job cancellation request has been successfully completed.

Note

States such as PAUSED is not yet implemented in SeqsLab automated job.

Interactive jobs#

State

Description

UNKNOWN

Job has been created into SeqsLab, but no backend instance has been requested.

QUEUED

The request for a backend instance has been dispatched, and the job is currently awaiting the provisioning of backend resources.

INITIALIZING

SeqsLab is provisioning the requested cluster resources including SQL endpoint and jupyter lab endpoint.

RUNNING

Interactive job is ready to serve requests.

COMPLETE

Interactive job has terminated completely.

SYSTEM_ERROR

Error related to interactive job provision occurs. Please refer to the exit codes for additional information.

CANCELING

Job cancellation has been triggered by either user or schedule job, prompting SeqsLab to commence the removal of all associated resources.

Note

States such as EXECUTOR_ERROR, CANCELED and PAUSED are not yet implemented in SeqsLab interactive job.

Exit codes#

Exit codes, specifically error codes, offer detailed information regarding the cause of errors in a job run. During the job’s lifecycle, only three states (CANCELED, EXECUTOR_ERROR, and SYSTEM_ERROR) will generate an exit code. There are no corresponding exit codes for other states, such as COMPLETE or RUNNING. Here, we provide a list of exit codes that users may encounter after the execution of a job run:

Automated jobs#

Canceled#

SeqsLab allows users to manually terminate a submitted job in case of incorrect job request specifications. Additionally, it automatically cancels a submitted job when the requested resources are insufficient for job execution. Depending on the task size within an automated job, we recommend varying the wm_size to prevent out-of-memory errors in the workflow execution engine, thereby mitigating unexpected backend resource usage.

Note

For automated jobs with fewer than 500 tasks, it is recommended to configure them with Standard_D1_v2 as the wm_size. For jobs that involve more than 500 tasks, it is advisable to carefully evaluate and adjust the wm_size accordingly.

Exit Code

Description

Action

Error Message Example

14

User-initiated cancellation of job.

Initiate a new job using a different configuration.

Job has been aborted by user.

15

Automatic job cancellation to avoid unexpected consumption.

Submit a new job with an increased wm_size.

To avoid unexpected backend consumption, job has been automatically aborted. Please adjust larger wm_size.

Note

SeqsLab provides a Smart Job Reuse feature that allows jobs to resume from the point of interruption or modification.

Executor error#

SeqsLab utilizes its kernel to execute each task described in WDL. To pinpoint the root cause of an Executor Error, you can perform debugging using the kernel exit code.

Exit Code

Description

Action

Error Message Example

18

Fail to execute command of WDL tasks with provided exit code.

Check execution log and kernel exit code.

Task fails with kernel exit code: 1

Users can conveniently review detailed error messages from the stdout, stderr, and driver logs of error tasks through the SeqsLab Console’s Activity section. Alternatively, for a rapid comprehension of the error’s root cause, refer to the kernel exit codes listed in the table below.

Kernel Exit Code

Description

Action

Error Message Example

1

Configuration error such as inequality of data partition number.

Based on detail message, review WDL and configurations of job.

assert failed: detail message.

2

Command failed.

Check detail message from stdout, stderr and driver log of error task.

Error: Command failed at ...

3

Dataset error such as access methods or dataset missing.

Review DrsObject to see whether it is registered correctly.

Error: error in DRS XXX for MainWDL.SubWorkflow.Task.input.

4

Dataset integrity check failed.

Review DrsObject to see whether size and checksums are correct.

Error: Invalid DRS object checksum.

5

Missing Output.

Make sure that output is generated inside WDL task.

Error: Missing Output at ...

6

Operator configuration error such as operator not found.

Review your Acceleration settings.

Error: Operator failed.

7

Other kernel error.

Please contact SeqsLab's support team.

-

System error#

System Error can be generated from two different job types, Interactive Job and Automated Job.

Exit Code

Description

Action

Error Message Example

9

Digests of docker images from the registry do not match the digests registered in tool registry service (TRS).

Assess the correctness of tool images digests recorded in TRS.

Integrity check of docker images fails.

10

Unable to add output fqn at the main workflow level.

Please contact SeqsLab's support team.

Failed to add outer fqns: MainWDL.SubWorkflow.Task

11

Invalid input JSON for Workflow Execution Engine.

Review the correctness of InputConnection in this job.

Required workflow input 'RNAseq.refTptAnnotateGtf' not specified.

12

Fail to materialize WDL at the initializing stage.

Review the correctness of the WDL in this job.

Failed to materialize: xxx

13

Fail to register task outputs to DRS.

Make sure that no special character exists in the run name.

Unsuccessful registration of DRS object.

16

Cluster becomes unusable while executing command of WDL tasks due to reasons such as running out of disk.

Check this link for more info. Apply larger runtime option for next run.

Cluster becomes unusable due to reasons such as running out of disk.

17

Error message sent from workflow execution engine.

Check error message for more details and please contact SeqsLab's support team.

java.lang.IllegalArgumentException: Docker image has an invalid syntax.

20

Cluster is not able to be provisioned. Usually caused by errors of docker image.

Make sure the correctness of docker image. Check this link for more reasons making nodes in state starttaskfailed and unusable.

All nodes of pool test-xnwnlakckme become starttaskfailed/unusable. Shutdown cluster!

21

Master node of provisioned cluster goes into starttaskfailed or unusable state.

Make sure the correctness of docker image. Check this link for more reasons making nodes in state starttaskfailed and unusable.

Master of pool test-xnwnlakckme becomes starttaskfailed/unusable. Shutdown cluster!

22

Azure Batch is not able to be provisioned at the query level.

Make sure workspace is created following by this link.

To be added.

23

Job quota of Azure Batch has reached. Too many job and job schedules submitted to Azure Batch.

Make sure workspace is created following by this link.

Active job and job schedule quota for the account has been reached.

24

Five retries at the task level have been reached.

Please contact SeqsLab's support team.

To be added.

25

Error occurs inside the configuration of RunRequest which makes /wes/v1/runs/{run_id}/files/ API fail to return a zip file.

Adjust InputConnection of this job based on returned error messages.

detail:drs object with label test/SomaticAnalysis/inFileFqsTumor/1/2 not found,code:internal_server_error

26

MaxTaskRetryCount has already set to 0. This is an Azure Batch internal retry due to a recovery operation.

Apply Smart Reuse to resubmit a job. Otherwise, please contact SeqsLab's support team.

Unexpected retry.

27

Unexpected reason which shut down spark cluster.

Apply Smart Reuse to resubmit a job. Otherwise, please contact SeqsLab's support team.

Cluster disappears unexpectedly.

28

Unable to decompress zip file downloaded from WES files API.

Apply Smart Reuse to resubmit a job. Otherwise, please contact SeqsLab's support team.

Unable to unzip run_id.zip file.

29

Workflow Execution Engine is not terminated with correct exit code. Usually it is caused by errors from Java Runtime Environment.

Apply Smart Reuse to resubmit a job. Otherwise, please contact SeqsLab's support team.

WE2 is terminated with exit code 134.

Note

For additional debugging information, please refer to the “driver.log” file accessible via the SeqsLab Console.

Interactive jobs#

System error#

Exit Code

Description

Action

Error Message Example

50

Unable to launch JupyterLab and Spark Thrift Server service inside interactive cluster.

Please contact SeqsLab's support team.

Unable to initiate lab and sql endpoint.

51

Issues such as docker image errors will cause this type of error.

Make sure the correctness of docker image.

Unable to provision clusters.

52

Azure Batch is unable to be provisioned through aztk.

Please contact SeqsLab's support team.

Error launching cluster in workspace with exit code: 1

Note

For SeqsLab version earlier than 2023-Nov, exit codes of System error for interactive jobs are different from the above table.