Filesystems

Workflow Filesystems

1: EFS Workflow Filesystem
2: S3 Workflow Filesystem

The tasks in a workflow require a common filesystem or scratch space where the outputs of tasks can be written so they are available to the inputs of dependent tasks in the same workflow. The following pages provide details on the engine filesystems that can be deployed by Amazon Genomics CLI.

1 - EFS Workflow Filesystem

Amazon EFS Workflow Filesystem

Workflow engines that support it may use Amazon EFS as a shared “scratch” space for hosting workflow intermediates and outputs. Initial inputs are localized once from S3 and final outputs are written back to S3 when the workflow is complete. All intermediate I/O is performed against the EFS filesystem.

Advantages

Compared with the S3 Filesystem there is no redundant I/O of inputs from S3.
Each tasks individual I/O operations tend to be smaller than the copy from S3 so there is less network congestion on the container host.
Option to use provisioned IOPs to provide high sustained throughput.
The volume is elastic and will expand and contract as needed.
It is simple to start an Amazon EC2 instance from the AWS console and connect it to the EFS volume to view outputs as they are created. This can be useful for debugging a workflow.

Disadvantages

Amazon EFS volumes are more expensive than storing intermediates and output in S3, especially when the volume uses provisioned IOPs.
The volume exists for the lifetime of the context and will incur costs based on its size for the lifetime of the context. If you no longer need the context we recommend destroying it.
Call caching is only possible for as long as the volume exists, i.e. the lifetime of the context.

Provisioned IOPs

Amazon EFS volumes deployed by the Amazon Genomics CLI use “bursting” throughput by default. For workflows that have high I/O throughput or in scenarios where you may have many workflows running in the same context at the same time, you may exhaust the burst credits of the volume. This might cause a workflow to slow down or even fail. Available volume credits can be monitored in the Amazon EFS console, and/ or Amazon CloudWatch. If you observe the exhaustion of burst credits you may want to consider deploying a context with provisioned throughput IOPs.

The following fragment of an agc-project.yaml file is an example of how to configure provisioned throughput for the Amazon EFS volume used by miniwdl in an Amazon Genomics CLI context.

myContext:
    engines:
      - type: wdl
        engine: miniwdl
        filesystem:
          fsType: EFS
          configuration:
            provisionedThroughput: 1024

Supporting Engines

The use of Amazon EFS as a shared file system is supported by the miniwdl and Snakemake engines. Both use EFS with bursting throughput by default and both support provisioned IOPs.

2 - S3 Workflow Filesystem

Amazon S3 Workflow Filesystem

Some workflow engines deployed by Amazon Genomics CLI can use S3 as their shared “filesystem”. Because S3 is not a POSIX compliant filesystem and most of the applications run by workflow tasks will require POSIX files, inputs will be localized from Amazon S3 and outputs will be delocalized to Amazon S3.

Advantages

Inputs are read into each task’s container and are not available by a common container mount so there is no possibility of containers on the same host over-writing or accessing another tasks inputs
No shared file system needs to be provisioned for a contexts compute environment thereby reducing ongoing costs.
All intermediate task outputs and all workflow outputs are persisted to the S3 bucket provisioned by Amazon Genomics CLI and this bucket will remain after contexts are destroyed and even after Amazon Genomics CLI is deactivated in the account.
Container hosts use an auto-expansion strategy for their local EBS volumes so disk sizes don’t need to be stated.

Disadvantages

Container hosts running multiple tasks may exhaust their aggregate network bandwidth (see below).
It is assumed that no other external process will be making changes to the S3 objects during a workflow run. If this does happen, the run may fail or be corrupted.

Network Bandwidth Considerations

During workflows with large numbers of concurrent steps that all rely on large inputs you may observe that the localization of inputs to the containers will become very slow. This is because a single EC2 container host may have multiple containers all competing for limited bandwidth. In these cases we recommend the following possible mitigations:

Consider using a shared filesystem such as EFS for your engine or an engine that supports EFS
Configure your agc-project.yaml such that a context is available that uses instance types that are network optimized. For example used m5n instance types rather than m5 and use instance types that offer sustained throughput rather than bursting throughput such as instances with more than 16 vCPU.
Consider modifying your workflow to request larger memory and vCPU amounts for these tasks. This will tend to ensure AWS Batch selects larger instances with better performance as well as placing fewer containers per host resulting in less competition for bandwidth.

These mitigations may result in the use of more expensive infrastructure but can ultimately save money by completing the workflow quicker. The best price-performance configuration will vary by workflow.

Supporting Engines

The Cromwell and Nextflow engines both support the use of Amazon S3 as a filesystem. Contexts using these engines will use this filesystem by default.