The following pages provide details on the workflow engines that are currently supported by Amazon Genomics CLI.
This is the multi-page printable view of this section. Click here to print.
Workflow Engines
1 - Filesystems
The tasks in a workflow require a common filesystem or scratch space where the outputs of tasks can be written so they are available to the inputs of dependent tasks in the same workflow. The following pages provide details on the engine filesystems that can be deployed by Amazon Genomics CLI.
1.1 - EFS Workflow Filesystem
Amazon EFS Workflow Filesystem
Workflow engines that support it may use Amazon EFS as a shared “scratch” space for hosting workflow intermediates and outputs. Initial inputs are localized once from S3 and final outputs are written back to S3 when the workflow is complete. All intermediate I/O is performed against the EFS filesystem.
Advantages
- Compared with the S3 Filesystem there is no redundant I/O of inputs from S3.
- Each tasks individual I/O operations tend to be smaller than the copy from S3 so there is less network congestion on the container host.
- Option to use provisioned IOPs to provide high sustained throughput.
- The volume is elastic and will expand and contract as needed.
- It is simple to start an Amazon EC2 instance from the AWS console and connect it to the EFS volume to view outputs as they are created. This can be useful for debugging a workflow.
Disadvantages
- Amazon EFS volumes are more expensive than storing intermediates and output in S3, especially when the volume uses provisioned IOPs.
- The volume exists for the lifetime of the context and will incur costs based on its size for the lifetime of the context. If you no longer need the context we recommend destroying it.
- Call caching is only possible for as long as the volume exists, i.e. the lifetime of the context.
Provisioned IOPs
Amazon EFS volumes deployed by the Amazon Genomics CLI use “bursting” throughput by default. For workflows that have high I/O throughput or in scenarios where you may have many workflows running in the same context at the same time, you may exhaust the burst credits of the volume. This might cause a workflow to slow down or even fail. Available volume credits can be monitored in the Amazon EFS console, and/ or Amazon CloudWatch. If you observe the exhaustion of burst credits you may want to consider deploying a context with provisioned throughput IOPs.
The following fragment of an agc-project.yaml
file is an example of how to configure provisioned throughput for the
Amazon EFS volume used by miniwdl in an Amazon Genomics CLI context.
myContext:
engines:
- type: wdl
engine: miniwdl
filesystem:
fsType: EFS
configuration:
provisionedThroughput: 1024
Supporting Engines
The use of Amazon EFS as a shared file system is supported by the miniwdl and Snakemake engines. Both use EFS with bursting throughput by default and both support provisioned IOPs.
1.2 - S3 Workflow Filesystem
Amazon S3 Workflow Filesystem
Some workflow engines deployed by Amazon Genomics CLI can use S3 as their shared “filesystem”. Because S3 is not a POSIX compliant filesystem and most of the applications run by workflow tasks will require POSIX files, inputs will be localized from Amazon S3 and outputs will be delocalized to Amazon S3.
Advantages
- Inputs are read into each task’s container and are not available by a common container mount so there is no possibility of containers on the same host over-writing or accessing another tasks inputs
- No shared file system needs to be provisioned for a contexts compute environment thereby reducing ongoing costs.
- All intermediate task outputs and all workflow outputs are persisted to the S3 bucket provisioned by Amazon Genomics CLI and this bucket will remain after contexts are destroyed and even after Amazon Genomics CLI is deactivated in the account.
- Container hosts use an auto-expansion strategy for their local EBS volumes so disk sizes don’t need to be stated.
Disadvantages
- Container hosts running multiple tasks may exhaust their aggregate network bandwidth (see below).
- It is assumed that no other external process will be making changes to the S3 objects during a workflow run. If this does happen, the run may fail or be corrupted.
Network Bandwidth Considerations
During workflows with large numbers of concurrent steps that all rely on large inputs you may observe that the localization of inputs to the containers will become very slow. This is because a single EC2 container host may have multiple containers all competing for limited bandwidth. In these cases we recommend the following possible mitigations:
- Consider using a shared filesystem such as EFS for your engine or an engine that supports EFS
- Configure your
agc-project.yaml
such that a context is available that uses instance types that are network optimized. For example usedm5n
instance types rather thanm5
and use instance types that offer sustained throughput rather than bursting throughput such as instances with more than 16 vCPU. - Consider modifying your workflow to request larger memory and vCPU amounts for these tasks. This will tend to ensure AWS Batch selects larger instances with better performance as well as placing fewer containers per host resulting in less competition for bandwidth.
These mitigations may result in the use of more expensive infrastructure but can ultimately save money by completing the workflow quicker. The best price-performance configuration will vary by workflow.
Supporting Engines
The Cromwell and Nextflow engines both support the use of Amazon S3 as a filesystem. Contexts using these engines will use this filesystem by default.
2 - miniwdl
Description
miniwdl is free open source software distributed under the MIT licence developed by the Chan Zuckerberg Initiative.
The source code for miniwdl is available on GitHub. When deployed with Amazon Genomics CLI miniwdl makes use of the miniwdl-aws extension which is also distributed under the MIT licence.
Architecture
There are four components of a miniwdl engine as deployed in an Amazon Genomics CLI context:
WES Adapter
Amazon Genomics CLI communicates with the miniwdl engine via a GA4GH WES REST service. The WES Adapter implements the WES standard and translates WES calls into calls to the miniwdl head process.
Head Compute Environment
For every workflow submitted, the WES adapter will create a new AWS Batch Job that contains the miniwdl process responsible for running that workflow. These miniwdl “head” jobs are run in an “On-demand” AWS Fargate compute environment even when the actual workflow tasks run in a Spot environment. This is to prevent Spot interruptions from terminating the workflow coordinator.
Task Compute Environment
Workflow tasks are submitted by the miniwdl head job to an AWS Batch queue and run in containers using an AWS Compute Environment. Container characteristics are defined by the resources requested in the workflow configuration. AWS Batch coordinates the elastic provisioning of EC2 instances (container hosts) based on the available work in the queue. Batch will place containers on container hosts as space allows.
Session Cache and Input Localization
Any context with a miniwdl engine will use an Amazon Elastic File System (EFS) volume as scratch space. Inputs from S3 are localized to the volume by jobs that the miniwdl engine spawns to copy these files to the volume. Outputs are copied back to S3 using a similar process. Workflow tasks access the EFS volume to obtain inputs and write intermediates and outputs.
The EFS volume is used by all miniwdl engine “head” jobs to store metadata necessary for call caching.
The EFS volume will remain in your account for the lifetime of the context and are destroyed when contexts are destroyed. Because the volume will grow in size as you run more workflows we recommend destroying the context when done to avoid on going EFS charges.
Using miniwdl as a Context Engine
You may declare miniwdl to be the engine
for any contexts wdl
type engine. For example:
contexts:
onDemandCtx:
requestSpotInstances: false
engines:
- type: wdl
engine: miniwdl
Call Caching
Call caching is enabled by default for miniwdl and because the metadata is stored in the contexts EFS volume call caching will work across different engine “head” jobs.
To disable call caching you can provide the --no-cache
engine option. You may do this in a workflows MANIFEST.json
by
adding the following key/ value pair.
"engineOptions": "--no-cache"
3 - Toil
Description
Toil is a workflow engine developed by the Computational Genomics Lab at the UC Santa Cruz Genomics Institute. In Amazon Genomics CLI, Toil is an engine that can be deployed in a context as an engine to run workflows written in the Common Workflow Language (CWL) standard, version v1.0, v1.1, and v1.2 (or mixed versions).
Toil is an open source project distributed by UC Santa Cruz under the Apache 2 license and available on GitHub.
Architecture
There are two components of a Toil engine as deployed in an Amazon Genomics CLI context:
Toil Server
The Toil engine is run in “server mode” as a container service in ECS. The engine can run multiple workflows asynchronously. Workflow tasks are run in an elastic compute environment and monitored by Toil. Amazon Genomics CLI communicates with the Toil engine via a GA4GH WES REST service which the server offers, available via API Gateway.
Task Compute Environment
Workflow tasks are submitted by Toil to an AWS Batch queue and run in
Toil-provided containers using an AWS Compute Environment. Tasks which use the
CWL DockerRequirement
will additionally be run in sibling containers on the host Docker daemon. AWS
Batch coordinates the elastic provisioning of EC2 instances (container hosts)
based on the available work in the queue. Batch will place containers on
container hosts as space allows.
Disk Expansion
Container hosts in the Batch compute environment use EBS volumes as local scratch space. As an EBS volume approaches a capacity threshold, new EBS volumes will be attached and merged into the file system. These volumes are destroyed when AWS Batch terminates the container host. CWL disk space requirements are ignored by Toil when running against AWS Batch.
This setup means that workflows that succeed on AGC may fail on other CWL runners (because they do not request enough disk space) and workflows that succeed on other CWL runners may fail on AGC (because they allocate disk space faster than the expansion process can react).
4 - Cromwell
Description
Cromwell is a workflow engine developed by the Broad Institute. In Amazon Genomics CLI, Cromwell is an engine that can be deployed in a context as an engine to run workflows based on the WDL specification.
Cromwell is an open source project distributed by the Broad Institute under the Apache 2 license and available on GitHub.
Customizations
Some minor customizations where made to the AWS Backend adapter for Cromwell to facilitate improved scalability and cross region S3 bucket access when deployed with Amazon Genomics CLI. The fork containing these customizations is available here and we are working to contribute these back to the main code base.
Architecture
There are four components of a Cromwell engine as deployed in an Amazon Genomics CLI context.
WES Adapter
Amazon Genomics CLI communicates with the Cromwell engine via a GA4GH WES REST service. The WES Adapter implements the WES standard and translates WES calls into calls to the Cromwell REST API. The adapter runs as an Amazon ECS service available via API Gateway.
Cromwell Server
The Cromwell engine is run in “server mode” as a container service in ECS and receives instructions from the WES Adapter. The engine can run multiple workflows asynchronously. Workflow tasks are run in an elastic compute environment and monitored by Cromwell.
Session Cache
Cromwell can use workflow run metadata to perform call caching. When deployed by Amazon Genomics CLI call caching is enabled by default. Metadata is stored by an embedded HSQL DB with file storage in an attached EFS volume. The EFS volume exists for the lifetime of the context the engine is deployed in so re-runs of workflows within the lifetime can benefit from call caching.
Task Compute Environment
Workflow tasks are submitted by Cromwell to an AWS Batch queue and run in containers using an AWS Compute Environment.
Container characteristics are defined by the runtime
. AWS Batch coordinates the elastic provisioning of EC2 instances (container hosts)
based on the available work in the queue. Batch will place containers on container hosts as space allows.
Fetch and Run Strategy
Execution of workflow tasks uses a “Fetch and Run” strategy. The commands specified in the command
section of the WDL task
are written as a file to S3 and “fetched” into the container and run.
The script is “decorated” with instructions to fetch any File
inputs from S3 and to write any File
outputs back to S3.
Disk Expansion
Container hosts in the Batch compute environment use EBS volumes as local scratch space. As an EBS volume approaches a
capacity threshold, new EBS volumes will be attached and merged into the file system. These volumes are destroyed when
AWS Batch terminates the container host. For this reason it is not necessary to specify disk requirements for the task
runtime
and these WDL directives will be ignored.
AWS Batch Retries
The Cromwell AWS Batch backend supports AWS Batch’s task retry option allowing failed tasks to attempt to run again. This
can be useful for adding resilience to a workflow from sporadic infrastructure failures. It is especially useful when using
an Amazon Genomics CLI “spot” context as spot instances can be terminated with minimal warning. To enable retries, add
the following option to your runtime
section of a task:
runtime {
...
awsBatchRetryAttempts: <int>
...
}
where <int>
is an integer specifying the number of retries up to a maximum of 10
.
Although similar to the WDL preemptible
option, awsBatchRetryAttempts
has differences in how retries are implemented. Notably,
the implementation falls back on the AWS Batch retry strategy and will retry a task that fails for any reason; whereas the preemptible
option is more specific to failures caused by preemption. At this time the preemptible
option is not supported by Amazon Genomics CLI
and is ignored.
5 - Nextflow
Description
Nextflow is free open source software distributed under the Apache 2.0 licence developed by Seqera Labs. The project was started in the Notredame Lab at the Centre for Genomic Regulation (CRG).
The source code for Nextflow is available on GitHub.
Architecture
There are four components of a Nextflow engine as deployed in an Amazon Genomics CLI context:
WES Adapter
Amazon Genomics CLI communicates with the Nextflow engine via a GA4GH WES REST service. The WES Adapter implements the WES standard and translates WES calls into calls to the Nextflow head process.
Head Compute Environment
For every workflow submitted, the WES adapter will create a new AWS Batch Job that contains the Nextflow process responsible for running that workflow. These Nextflow “head” jobs are run in an “On-demand” compute environment even when the actual workflow tasks run in a Spot environment. This is to prevent Spot interruptions from terminating the workflow coordinator.
Task Compute Environment
Workflow tasks are submitted by the Nextflow head job to an AWS Batch queue and run in containers using an AWS Compute Environment. Container characteristics are defined by the resources requested in the workflow configuration. AWS Batch coordinates the elastic provisioning of EC2 instances (container hosts) based on the available work in the queue. Batch will place containers on container hosts as space allows.
Fetch and Run Strategy
Execution of workflow tasks uses a “Fetch and Run” strategy. Input files required by a workflow task are fetched from S3 into the task container. Output files are copied out of the container to S3.
Disk Expansion
Container hosts in the Batch compute environment use EBS volumes as local scratch space. As an EBS volume approaches a capacity threshold, new EBS volumes will be attached and merged into the file system. These volumes are destroyed when AWS Batch terminates the container host.
6 - Snakemake
Description
Snakemake is free open source software distributed under the MIT licence developed by Johannes Köster and their team.
The source code for snakemake is available on GitHub. When deployed with Amazon Genomics CLI snakemake uses Batch to distribute the underlying tasks.
Architecture
There are four components of a snakemake engine as deployed in an Amazon Genomics CLI context:
WES Adapter
Amazon Genomics CLI communicates with the snakemake engine via a GA4GH WES REST service. The WES Adapter implements the WES standard and translates WES calls into calls to the snakemake head process.
Head Compute Environment
For every workflow submitted, the WES adapter will create a new AWS Batch Job that contains the snakemake process responsible for running that workflow. These snakemake “head” jobs are run in an “On-demand” AWS Fargate compute environment even when the actual workflow tasks run in a Spot environment. This is to prevent Spot interruptions from terminating the workflow coordinator.
Task Compute Environment
Workflow tasks are submitted by the snakemake head job to an AWS Batch queue and run in containers using an AWS Compute Environment. Container characteristics are defined by the resources requested in the workflow configuration. AWS Batch coordinates the elastic provisioning of EC2 instances (container hosts) based on the available work in the queue. Batch will place containers on container hosts as space allows.
Session Cache and Input Localization
Any context with a snakemake engine will use an Amazon Elastic File System (EFS) volume as scratch space. Inputs from the workflow are localized to the volume by jobs that the snakemake engine spawns to copy these files to the volume. Outputs are copied back to S3 after the workflow is complete. Workflow tasks access the EFS volume to obtain inputs and write intermediates and outputs.
The EFS volume can used by all snakemake engine “head” jobs to store metadata necessary for dependency caching by specifying an argument
for the conda workspace that is common across all executions. An example of this is --conda-prefix /mnt/efs/snakemake/conda
.
The EFS volume will remain in your account for the lifetime of the context and are destroyed when contexts are destroyed. Because the volume will grow in size as you run more workflows we recommend destroying the context when done to avoid on going EFS charges.
Using Snakemake as a Context Engine
You may declare snakemake to be the engine
for any contexts snakemake
type engine. For example:
contexts:
onDemandCtx:
requestSpotInstances: false
engines:
- type: snakemake
engine: snakemake
Conda Dependency Caching
Dependency caching is disabled by default so that each workflow can be run independently. If you would like workflow runs to re-use the Conda cache then please specify a folder under “/mnt/efs” which is where the EFS storage space is attached. This will enable snakemake to re-use the dependency which will decrease the time that subsequent workflow runs will take.
To disable call caching you can provide the --conda-prefix
engine option. You may do this in a workflows MANIFEST.json
by
adding the following key/ value pair.
"engineOptions": "-j 10 --conda-prefix /mnt/efs/snakemake/conda"