This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Workflow Engines

Supported Workflow Engines

1: Filesystems

1.1: EFS Workflow Filesystem
1.2: S3 Workflow Filesystem

2: miniwdl
3: Toil
4: Cromwell
5: Nextflow
6: Snakemake

The following pages provide details on the workflow engines that are currently supported by Amazon Genomics CLI.

1 - Filesystems

Workflow Filesystems

The tasks in a workflow require a common filesystem or scratch space where the outputs of tasks can be written so they are available to the inputs of dependent tasks in the same workflow. The following pages provide details on the engine filesystems that can be deployed by Amazon Genomics CLI.

1.1 - EFS Workflow Filesystem

Amazon EFS Workflow Filesystem

Workflow engines that support it may use Amazon EFS as a shared “scratch” space for hosting workflow intermediates and outputs. Initial inputs are localized once from S3 and final outputs are written back to S3 when the workflow is complete. All intermediate I/O is performed against the EFS filesystem.

Advantages

Compared with the S3 Filesystem there is no redundant I/O of inputs from S3.
Each tasks individual I/O operations tend to be smaller than the copy from S3 so there is less network congestion on the container host.
Option to use provisioned IOPs to provide high sustained throughput.
The volume is elastic and will expand and contract as needed.
It is simple to start an Amazon EC2 instance from the AWS console and connect it to the EFS volume to view outputs as they are created. This can be useful for debugging a workflow.

Disadvantages

Amazon EFS volumes are more expensive than storing intermediates and output in S3, especially when the volume uses provisioned IOPs.
The volume exists for the lifetime of the context and will incur costs based on its size for the lifetime of the context. If you no longer need the context we recommend destroying it.
Call caching is only possible for as long as the volume exists, i.e. the lifetime of the context.

Provisioned IOPs

Amazon EFS volumes deployed by the Amazon Genomics CLI use “bursting” throughput by default. For workflows that have high I/O throughput or in scenarios where you may have many workflows running in the same context at the same time, you may exhaust the burst credits of the volume. This might cause a workflow to slow down or even fail. Available volume credits can be monitored in the Amazon EFS console, and/ or Amazon CloudWatch. If you observe the exhaustion of burst credits you may want to consider deploying a context with provisioned throughput IOPs.

The following fragment of an agc-project.yaml file is an example of how to configure provisioned throughput for the Amazon EFS volume used by miniwdl in an Amazon Genomics CLI context.

myContext:
    engines:
      - type: wdl
        engine: miniwdl
        filesystem:
          fsType: EFS
          configuration:
            provisionedThroughput: 1024

Supporting Engines

The use of Amazon EFS as a shared file system is supported by the miniwdl and Snakemake engines. Both use EFS with bursting throughput by default and both support provisioned IOPs.

1.2 - S3 Workflow Filesystem

Amazon S3 Workflow Filesystem

Some workflow engines deployed by Amazon Genomics CLI can use S3 as their shared “filesystem”. Because S3 is not a POSIX compliant filesystem and most of the applications run by workflow tasks will require POSIX files, inputs will be localized from Amazon S3 and outputs will be delocalized to Amazon S3.

Advantages

Inputs are read into each task’s container and are not available by a common container mount so there is no possibility of containers on the same host over-writing or accessing another tasks inputs
No shared file system needs to be provisioned for a contexts compute environment thereby reducing ongoing costs.
All intermediate task outputs and all workflow outputs are persisted to the S3 bucket provisioned by Amazon Genomics CLI and this bucket will remain after contexts are destroyed and even after Amazon Genomics CLI is deactivated in the account.
Container hosts use an auto-expansion strategy for their local EBS volumes so disk sizes don’t need to be stated.

Disadvantages

Container hosts running multiple tasks may exhaust their aggregate network bandwidth (see below).
It is assumed that no other external process will be making changes to the S3 objects during a workflow run. If this does happen, the run may fail or be corrupted.

Network Bandwidth Considerations

During workflows with large numbers of concurrent steps that all rely on large inputs you may observe that the localization of inputs to the containers will become very slow. This is because a single EC2 container host may have multiple containers all competing for limited bandwidth. In these cases we recommend the following possible mitigations:

Consider using a shared filesystem such as EFS for your engine or an engine that supports EFS
Configure your agc-project.yaml such that a context is available that uses instance types that are network optimized. For example used m5n instance types rather than m5 and use instance types that offer sustained throughput rather than bursting throughput such as instances with more than 16 vCPU.
Consider modifying your workflow to request larger memory and vCPU amounts for these tasks. This will tend to ensure AWS Batch selects larger instances with better performance as well as placing fewer containers per host resulting in less competition for bandwidth.

These mitigations may result in the use of more expensive infrastructure but can ultimately save money by completing the workflow quicker. The best price-performance configuration will vary by workflow.

Supporting Engines

The Cromwell and Nextflow engines both support the use of Amazon S3 as a filesystem. Contexts using these engines will use this filesystem by default.

2 - miniwdl

Details on the miniwdl engine deployed by Amazon Genomics CLI

Description

miniwdl is free open source software distributed under the MIT licence developed by the Chan Zuckerberg Initiative.

The source code for miniwdl is available on GitHub. When deployed with Amazon Genomics CLI miniwdl makes use of the miniwdl-aws extension which is also distributed under the MIT licence.

Architecture

There are four components of a miniwdl engine as deployed in an Amazon Genomics CLI context:

Image of infrastructure deployed in a miniwdl context

WES Adapter

Amazon Genomics CLI communicates with the miniwdl engine via a GA4GH WES REST service. The WES Adapter implements the WES standard and translates WES calls into calls to the miniwdl head process.

Head Compute Environment

For every workflow submitted, the WES adapter will create a new AWS Batch Job that contains the miniwdl process responsible for running that workflow. These miniwdl “head” jobs are run in an “On-demand” AWS Fargate compute environment even when the actual workflow tasks run in a Spot environment. This is to prevent Spot interruptions from terminating the workflow coordinator.

Task Compute Environment

Workflow tasks are submitted by the miniwdl head job to an AWS Batch queue and run in containers using an AWS Compute Environment. Container characteristics are defined by the resources requested in the workflow configuration. AWS Batch coordinates the elastic provisioning of EC2 instances (container hosts) based on the available work in the queue. Batch will place containers on container hosts as space allows.

Session Cache and Input Localization

Any context with a miniwdl engine will use an Amazon Elastic File System (EFS) volume as scratch space. Inputs from S3 are localized to the volume by jobs that the miniwdl engine spawns to copy these files to the volume. Outputs are copied back to S3 using a similar process. Workflow tasks access the EFS volume to obtain inputs and write intermediates and outputs.

The EFS volume is used by all miniwdl engine “head” jobs to store metadata necessary for call caching.

The EFS volume will remain in your account for the lifetime of the context and are destroyed when contexts are destroyed. Because the volume will grow in size as you run more workflows we recommend destroying the context when done to avoid on going EFS charges.

Using miniwdl as a Context Engine

You may declare miniwdl to be the engine for any contexts wdl type engine. For example:

contexts:
  onDemandCtx:
    requestSpotInstances: false
    engines:
      - type: wdl
        engine: miniwdl

Call Caching

Call caching is enabled by default for miniwdl and because the metadata is stored in the contexts EFS volume call caching will work across different engine “head” jobs.

To disable call caching you can provide the --no-cache engine option. You may do this in a workflows MANIFEST.json by adding the following key/ value pair.

  "engineOptions": "--no-cache"

3 - Toil

Details on the Toil engine (CWL mode) deployed by Amazon Genomics CLI

Description

Toil is a workflow engine developed by the Computational Genomics Lab at the UC Santa Cruz Genomics Institute. In Amazon Genomics CLI, Toil is an engine that can be deployed in a context as an engine to run workflows written in the Common Workflow Language (CWL) standard, version v1.0, v1.1, and v1.2 (or mixed versions).

Toil is an open source project distributed by UC Santa Cruz under the Apache 2 license and available on GitHub.

Architecture

There are two components of a Toil engine as deployed in an Amazon Genomics CLI context:

Image of infrastructure deployed in a Toil context

Toil Server

The Toil engine is run in “server mode” as a container service in ECS. The engine can run multiple workflows asynchronously. Workflow tasks are run in an elastic compute environment and monitored by Toil. Amazon Genomics CLI communicates with the Toil engine via a GA4GH WES REST service which the server offers, available via API Gateway.

Task Compute Environment

Workflow tasks are submitted by Toil to an AWS Batch queue and run in Toil-provided containers using an AWS Compute Environment. Tasks which use the CWL DockerRequirement will additionally be run in sibling containers on the host Docker daemon. AWS Batch coordinates the elastic provisioning of EC2 instances (container hosts) based on the available work in the queue. Batch will place containers on container hosts as space allows.

Disk Expansion

Container hosts in the Batch compute environment use EBS volumes as local scratch space. As an EBS volume approaches a capacity threshold, new EBS volumes will be attached and merged into the file system. These volumes are destroyed when AWS Batch terminates the container host. CWL disk space requirements are ignored by Toil when running against AWS Batch.

This setup means that workflows that succeed on AGC may fail on other CWL runners (because they do not request enough disk space) and workflows that succeed on other CWL runners may fail on AGC (because they allocate disk space faster than the expansion process can react).

4 - Cromwell

Details on the Cromwell engine deployed by Amazon Genomics CLI

Description

Cromwell is a workflow engine developed by the Broad Institute. In Amazon Genomics CLI, Cromwell is an engine that can be deployed in a context as an engine to run workflows based on the WDL specification.

Cromwell is an open source project distributed by the Broad Institute under the Apache 2 license and available on GitHub.

Customizations

Some minor customizations where made to the AWS Backend adapter for Cromwell to facilitate improved scalability and cross region S3 bucket access when deployed with Amazon Genomics CLI. The fork containing these customizations is available here and we are working to contribute these back to the main code base.

Architecture

There are four components of a Cromwell engine as deployed in an Amazon Genomics CLI context.

Image of infrastructure deployed in a Cromwell context

WES Adapter

Amazon Genomics CLI communicates with the Cromwell engine via a GA4GH WES REST service. The WES Adapter implements the WES standard and translates WES calls into calls to the Cromwell REST API. The adapter runs as an Amazon ECS service available via API Gateway.

Cromwell Server

The Cromwell engine is run in “server mode” as a container service in ECS and receives instructions from the WES Adapter. The engine can run multiple workflows asynchronously. Workflow tasks are run in an elastic compute environment and monitored by Cromwell.

Session Cache

Cromwell can use workflow run metadata to perform call caching. When deployed by Amazon Genomics CLI call caching is enabled by default. Metadata is stored by an embedded HSQL DB with file storage in an attached EFS volume. The EFS volume exists for the lifetime of the context the engine is deployed in so re-runs of workflows within the lifetime can benefit from call caching.

Task Compute Environment

Workflow tasks are submitted by Cromwell to an AWS Batch queue and run in containers using an AWS Compute Environment. Container characteristics are defined by the runtime. AWS Batch coordinates the elastic provisioning of EC2 instances (container hosts) based on the available work in the queue. Batch will place containers on container hosts as space allows.

Fetch and Run Strategy

Execution of workflow tasks uses a “Fetch and Run” strategy. The commands specified in the command section of the WDL task are written as a file to S3 and “fetched” into the container and run. The script is “decorated” with instructions to fetch any File inputs from S3 and to write any File outputs back to S3.

Disk Expansion

Container hosts in the Batch compute environment use EBS volumes as local scratch space. As an EBS volume approaches a capacity threshold, new EBS volumes will be attached and merged into the file system. These volumes are destroyed when AWS Batch terminates the container host. For this reason it is not necessary to specify disk requirements for the task runtime and these WDL directives will be ignored.

AWS Batch Retries

The Cromwell AWS Batch backend supports AWS Batch’s task retry option allowing failed tasks to attempt to run again. This can be useful for adding resilience to a workflow from sporadic infrastructure failures. It is especially useful when using an Amazon Genomics CLI “spot” context as spot instances can be terminated with minimal warning. To enable retries, add the following option to your runtime section of a task:

runtime {
    ...
    awsBatchRetryAttempts: <int>
    ...
}

where <int> is an integer specifying the number of retries up to a maximum of 10.

Although similar to the WDL preemptible option, awsBatchRetryAttempts has differences in how retries are implemented. Notably, the implementation falls back on the AWS Batch retry strategy and will retry a task that fails for any reason; whereas the preemptible option is more specific to failures caused by preemption. At this time the preemptible option is not supported by Amazon Genomics CLI and is ignored.

5 - Nextflow

Details on the Nextflow engine deployed by Amazon Genomics CLI

Description

Nextflow is free open source software distributed under the Apache 2.0 licence developed by Seqera Labs. The project was started in the Notredame Lab at the Centre for Genomic Regulation (CRG).

The source code for Nextflow is available on GitHub.

Architecture

There are four components of a Nextflow engine as deployed in an Amazon Genomics CLI context:

Image of infrastructure deployed in a Nextflow context

WES Adapter

Amazon Genomics CLI communicates with the Nextflow engine via a GA4GH WES REST service. The WES Adapter implements the WES standard and translates WES calls into calls to the Nextflow head process.

Head Compute Environment

For every workflow submitted, the WES adapter will create a new AWS Batch Job that contains the Nextflow process responsible for running that workflow. These Nextflow “head” jobs are run in an “On-demand” compute environment even when the actual workflow tasks run in a Spot environment. This is to prevent Spot interruptions from terminating the workflow coordinator.

Task Compute Environment

Workflow tasks are submitted by the Nextflow head job to an AWS Batch queue and run in containers using an AWS Compute Environment. Container characteristics are defined by the resources requested in the workflow configuration. AWS Batch coordinates the elastic provisioning of EC2 instances (container hosts) based on the available work in the queue. Batch will place containers on container hosts as space allows.

Fetch and Run Strategy

Execution of workflow tasks uses a “Fetch and Run” strategy. Input files required by a workflow task are fetched from S3 into the task container. Output files are copied out of the container to S3.

Disk Expansion

6 - Snakemake

Details on the Snakemake engine deployed by Amazon Genomics CLI

Description

Snakemake is free open source software distributed under the MIT licence developed by Johannes Köster and their team.

The source code for snakemake is available on GitHub. When deployed with Amazon Genomics CLI snakemake uses Batch to distribute the underlying tasks.

Architecture

There are four components of a snakemake engine as deployed in an Amazon Genomics CLI context:

Image of infrastructure deployed in a SnakeMake context

WES Adapter

Amazon Genomics CLI communicates with the snakemake engine via a GA4GH WES REST service. The WES Adapter implements the WES standard and translates WES calls into calls to the snakemake head process.

Head Compute Environment

For every workflow submitted, the WES adapter will create a new AWS Batch Job that contains the snakemake process responsible for running that workflow. These snakemake “head” jobs are run in an “On-demand” AWS Fargate compute environment even when the actual workflow tasks run in a Spot environment. This is to prevent Spot interruptions from terminating the workflow coordinator.

Task Compute Environment

Workflow tasks are submitted by the snakemake head job to an AWS Batch queue and run in containers using an AWS Compute Environment. Container characteristics are defined by the resources requested in the workflow configuration. AWS Batch coordinates the elastic provisioning of EC2 instances (container hosts) based on the available work in the queue. Batch will place containers on container hosts as space allows.

Session Cache and Input Localization

Any context with a snakemake engine will use an Amazon Elastic File System (EFS) volume as scratch space. Inputs from the workflow are localized to the volume by jobs that the snakemake engine spawns to copy these files to the volume. Outputs are copied back to S3 after the workflow is complete. Workflow tasks access the EFS volume to obtain inputs and write intermediates and outputs.

The EFS volume can used by all snakemake engine “head” jobs to store metadata necessary for dependency caching by specifying an argument for the conda workspace that is common across all executions. An example of this is --conda-prefix /mnt/efs/snakemake/conda.

Using Snakemake as a Context Engine

You may declare snakemake to be the engine for any contexts snakemake type engine. For example:

contexts:
  onDemandCtx:
    requestSpotInstances: false
    engines:
      - type: snakemake
        engine: snakemake

Conda Dependency Caching

Dependency caching is disabled by default so that each workflow can be run independently. If you would like workflow runs to re-use the Conda cache then please specify a folder under “/mnt/efs” which is where the EFS storage space is attached. This will enable snakemake to re-use the dependency which will decrease the time that subsequent workflow runs will take.

To disable call caching you can provide the --conda-prefix engine option. You may do this in a workflows MANIFEST.json by adding the following key/ value pair.

  "engineOptions": "-j 10 --conda-prefix /mnt/efs/snakemake/conda"