Toil

Details on the Toil engine (CWL mode) deployed by Amazon Genomics CLI

Description

Toil is a workflow engine developed by the Computational Genomics Lab at the UC Santa Cruz Genomics Institute. In Amazon Genomics CLI, Toil is an engine that can be deployed in a context as an engine to run workflows written in the Common Workflow Language (CWL) standard, version v1.0, v1.1, and v1.2 (or mixed versions).

Toil is an open source project distributed by UC Santa Cruz under the Apache 2 license and available on GitHub.

Architecture

There are two components of a Toil engine as deployed in an Amazon Genomics CLI context:

Image of infrastructure deployed in a Toil context

Toil Server

The Toil engine is run in “server mode” as a container service in ECS. The engine can run multiple workflows asynchronously. Workflow tasks are run in an elastic compute environment and monitored by Toil. Amazon Genomics CLI communicates with the Toil engine via a GA4GH WES REST service which the server offers, available via API Gateway.

Task Compute Environment

Workflow tasks are submitted by Toil to an AWS Batch queue and run in Toil-provided containers using an AWS Compute Environment. Tasks which use the CWL DockerRequirement will additionally be run in sibling containers on the host Docker daemon. AWS Batch coordinates the elastic provisioning of EC2 instances (container hosts) based on the available work in the queue. Batch will place containers on container hosts as space allows.

Disk Expansion

Container hosts in the Batch compute environment use EBS volumes as local scratch space. As an EBS volume approaches a capacity threshold, new EBS volumes will be attached and merged into the file system. These volumes are destroyed when AWS Batch terminates the container host. CWL disk space requirements are ignored by Toil when running against AWS Batch.

This setup means that workflows that succeed on AGC may fail on other CWL runners (because they do not request enough disk space) and workflows that succeed on other CWL runners may fail on AGC (because they allocate disk space faster than the expansion process can react).