This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Best Practices

Best practices when using Amazon Genomics CLI

Things to consider so that you can get the most out of Amazon Genomics CLI

1 - IAM Permissions

Minimum IAM Permissions required to use AGC

Amazon Genomics CLI is used to deploy and interact with infrastructure in an AWS account. Amazon Genomics CLI will use the permissions of the current profile to perform its actions. The profile would either be the users profile or, if being run from an EC2 instance, the attached profile of the instance. No matter the source of the role it must have sufficient permissions to perform the necessary tasks. In addition, best practice recommends that the profile only grant minimal permissions to maintain security and prevent unintended action.

As part of the Amazon Genomics CLI repository we have included a CDK project that can be used by an account administrator to generate the necessary minimum policies.

Pre-requisites

Before generating the policies you need to do the following:

  1. Install node and npm. We recommend using node v14.17 installed via nvm
  2. Install Amazon CDK (npm install -g aws-cdk@latest)
  3. An AWS account where you will use Amazon Genomics CLI
  4. A role in that account that allows the creation of IAM roles and policies

Generate Roles and Policies

  1. Clone the Amazon Genomics CLI repository locally: git clone https://github.com/aws/amazon-genomics-cli.git
  2. cd amazon-genomics-cli/extras/agc-minimal-permissions/
  3. npm install
  4. cdk deploy

You will see output similar to the following:

✨  Synthesis time: 2.91s

AgcPermissionsStack: deploying...
AgcPermissionsStack: creating CloudFormation changeset...

 ✅  AgcPermissionsStack

✨  Deployment time: 44.39s

Stack ARN:
arn:aws:cloudformation:us-east-1:123456789123:stack/AgcPermissionsStack/6ace55f0-b67c-11ec-a5d3-0a1e6da159c9

✨  Total time: 47.3s

Using the emitted Stack ARN you can identify the policies created. You can also inspect the stack in the CloudFormation console.

For example:

aws cloudformation describe-stack-resources --stack-name <stack arn>

with output similar to:

{
    "StackResources": [
        {
            "StackName": "AgcPermissionsStack",
            "StackId": "arn:aws:cloudformation:us-east-1:123456789123:stack/AgcPermissionsStack/6ace55f0-b67c-11ec-a5d3-0a1e6da159c9",
            "LogicalResourceId": "CDKMetadata",
            "PhysicalResourceId": "6ace55f0-b67c-11ec-a5d3-0a1e6da159c9",
            "ResourceType": "AWS::CDK::Metadata",
            "Timestamp": "2022-04-07T14:10:30.922000+00:00",
            "ResourceStatus": "CREATE_COMPLETE",
            "DriftInformation": {
                "StackResourceDriftStatus": "NOT_CHECKED"
            }
        },
        {
            "StackName": "AgcPermissionsStack",
            "StackId": "arn:aws:cloudformation:us-east-1:123456789123:stack/AgcPermissionsStack/6ace55f0-b67c-11ec-a5d3-0a1e6da159c9",
            "LogicalResourceId": "agcadminpolicy25003180",
            "PhysicalResourceId": "arn:aws:iam::123456789123:policy/AgcPermissionsStack-agcadminpolicy25003180-1ST0KJ0I5J45R",
            "ResourceType": "AWS::IAM::ManagedPolicy",
            "Timestamp": "2022-04-07T14:10:41.597000+00:00",
            "ResourceStatus": "CREATE_COMPLETE",
            "DriftInformation": {
                "StackResourceDriftStatus": "NOT_CHECKED"
            }
        },
        {
            "StackName": "AgcPermissionsStack",
            "StackId": "arn:aws:cloudformation:us-east-1:123456789123:stack/AgcPermissionsStack/6ace55f0-b67c-11ec-a5d3-0a1e6da159c9",
            "LogicalResourceId": "agcuserpolicy346A2D4F",
            "PhysicalResourceId": "arn:aws:iam::123456789123:policy/AgcPermissionsStack-agcuserpolicy346A2D4F-1X9U4HCQ8Z19U",
            "ResourceType": "AWS::IAM::ManagedPolicy",
            "Timestamp": "2022-04-07T14:10:41.981000+00:00",
            "ResourceStatus": "CREATE_COMPLETE",
            "DriftInformation": {
                "StackResourceDriftStatus": "NOT_CHECKED"
            }
        },
        {
            "StackName": "AgcPermissionsStack",
            "StackId": "arn:aws:cloudformation:us-east-1:123456789123:stack/AgcPermissionsStack/6ace55f0-b67c-11ec-a5d3-0a1e6da159c9",
            "LogicalResourceId": "agcuserpolicycdk27FA61BC",
            "PhysicalResourceId": "arn:aws:iam::123456789123:policy/AgcPermissionsStack-agcuserpolicycdk27FA61BC-OXS49AINGPIG",
            "ResourceType": "AWS::IAM::ManagedPolicy",
            "Timestamp": "2022-04-07T14:10:41.747000+00:00",
            "ResourceStatus": "CREATE_COMPLETE",
            "DriftInformation": {
                "StackResourceDriftStatus": "NOT_CHECKED"
            }
        }
    ]
}

Three resources of type AWS::IAM::ManagedPolicy are created:

  • The resource with a name similar to agcadminpolicy25003180 identify policies which grant sufficient permission to run agc account activate and agc account deactivate and should be attached to the profile of users who need to perform that action
  • Two resources with names similar to agcuserpolicy346A2D4F and agcuserpolicycdk27FA61BC identify policies which allow all other Amazon Genomics CLI actions. These should be attached to profiles that will use Amazon Genomics CLI day to day.

2 - Controlling Costs

Monitoring costs and design considerations to reduce costs

When you begin to run large scale workflows frequently it will become important to be able to understand the costs involved and how to optimize your workflow and use of Amazon Genomics CLI to reduce costs.

Use AWS Cost Explorer to Report on Costs

AWS Cost Explorer has an easy-to-use interface that lets you visualize, understand, and manage your AWS costs and usage over time. We recommend you use this tool to gain sight into the costs of running your genomics workflows. At the time of writing AWS Cost Explorer can only be enabled using the AWS Console so Amazon Genomics CLI won’t be able to set this up for you. As a first step you will need to enable cost explorer for your AWS account.

Amazon Genomics CLI will tag the infrastructure it creates with tags. Application, user, project and context tags are all generated as appropriate and these can be used as cost allocation tags to determine which account costs are coming from Amazon Genomics CLI and which user, context and project.

Within Cost Explorer the Amazon Genomics CLI tags will be referred to as “User Defined Cost Allocation Tags”. Before a tag can be used in a cost report it must be activated. Costs associated with tags are only available for infrastructure used after activation of a tag, so it will not be possible to retrospectively examine costs.

Optimizing Requested Container Resources

Tasks in a workflow typically run in Docker containers. Depending on the workflow language there will be some kind of runtime definition that specifies the number of vCPUs and amount of RAM allocated to the task. For example, in WDL you could specify

  runtime {
    docker: "biocontainers/plink1.9:v1.90b6.6-181012-1-deb_cv1"
    memory: "8 GB"
    cpu: 2
  }

The amount of resource allocated to each container ultimately impacts the cost to run a workflow. Optimally allocating resources leads to cost efficiency.

Optimize the longest running, and most parallel tasks first

When optimizing a workflow, focus on those tasks that run the longest as well as those that have the largest number of parallel tasks as they will make up the majority of the workflow runtime and contribute most to the cost.

Consider CPU and memory ratios

EC2 workers for Cromwell AWS Batch compute environments are c, m, and r instance families that have vCPU to memory ratios of 1:2, 1:4 and 1:8 respectively. Engines that run container based workflows will typically attempt to fit containers to instances in the most optimal way depending on cost and size requirements, or they will delegate this to a service like AWS Batch. Given that a task requiring 16 GB of RAM that could make use of all available CPUs, then to optimally pack the containers you should specify either 2, 4, or 8 vCPU. Other values could lead to inefficient packing meaning the resources of the EC2 container instance will be paid for but not optimally used.

NOTE: Fully packing an instance can result in it becoming unresponsive if the tasks in the containers use 100% (or more if they start swapping) of the allocated resources. The instance may then be unresponsive to its management services or the workflow engine and may time out. To avoid this, always allow for a little overhead, especially in the smaller instances.

The largest instance types deployed by default are from the 4xlarge size which have 16 vCPU and up to 128 GB of RAM.

Consider splitting tasks that pipe output

If a workflow task consists of a process that pipes STDOUT to another process then both processes will run in the same container and receive the same resources. If one task requires more resources than the other this might be inefficient, it may be better divided into two tasks each with its own runtime configuration. Note that this will require the intermediate STDOUT to be written to a file and copied between containers so if this output is very large then keeping the processes in the same task may be more efficient. Piping very large outputs may a lot of memory so your container will need an appropriate allocation of memory.

Use the most cost-effective instance generation

When you specify the instanceTypes in a context, as opposed to letting Amazon Genomics CLI do it for you, consider the cost and performance of the instance types with respect to your workflow requirements. Fifth generation EC2 types (c5, m5, r5) have a lower on-demand price and have higher clock speeds than their 4th generation counterparts (c4, m4, r4). Therefore, for on-demand compute environments, those instance types should be preferred. In spot compute environments we suggest using both 4th and 5th generation types as this increases the pool of available types meaning Batch will be able to choose the instance type that is cheapest and least likely to be interrupted.

Deploy Amazon Genomics CLI where your S3 data is

Genomics workflows may need to access considerable amounts of data stored in S3. Although S3 uses global namespaces, buckets do reside in regions. If you access a lot of S3 data it makes sense to deploy your Amazon Genomics CLI infrastructure in the same region to avoid cross region data charges.

Further, if you use a custom VPC we recommend deploying a VPC endpoint for S3 so that you do no incur NAT Gateway charges for data coming from the same region. If you do not you might find that NAT Gateway charges are the largest part of your workflow run costs. If you allow Amazon Genomics CLI to create your VPC (the default), appropriate VPC endpoints will be setup for you. Note that VPC endpoints cannot avoid cross region data charges, so you will still want to deploy in the region where most of your data resides.

Use Spot Instances

The use of Spot instances can significantly reduce costs of running workflows. However, spot instances may be interrupted when EC2 demand is high. Some workflow engines, such as Cromwell, can support retries of tasks that fail due to Spot interruption (among other things). To enable this for Cromwell, include the awsBatchRetryAttempts parameter in the runtime section of a WDL task with an integer number of attempts.

Even with retries, there is a risk that spot interruption will case a task or entire workflow to fail. Use of an engines call caching capabilities (if available) can help avoid repeating work if a partially complete workflow needs to be restarted du to Spot instance interruption.

Use private ECR registries

Each task in a workflow requires access to a container image, and some of these images can be several GB if they contain large packages like GATK. This can lead to large NAT Gateway traffic charges. To avoid these charges, we recommend deploying copies of frequently used container images into your accounts private ECR registry.

Amazon Genomics CLI deployed VPCs use a VPC gateway to talk to private ECR registries in your account thereby avoiding NAT Gateway traffic. The gateway is limited to registries in the same region as the VPC, so to avoid cross-region traffic you should deploy images into the region(s) that you use for Amazon Genomics CLI.

3 - Scaling Workloads

Making workflows run at scale

Workflows with considerable compute requirements can incur large costs and may fail due to infrastructure constraints. The following considerations will help you design workflows that will perform better at scale.

Large compute requirements

By default, contexts created by AGC will allocate compute nodes with a size of up to 4xlarge. These types have 16 vCPU and up to 128 GB of RAM. If an individual task requires additional resources you may specify these in the instanceTypes array of the project context. For example:

contexts:
    prod:
        requestSpotInstances: false
        instanceTypes:
            - c5.16xlarge
            - r5.24xlarge

Large data growth

When using the Nextflow or Cromwell engines the EC2 container instances that carry out the work use a script to detect and automatically expand disk capacity. Generally, this will allow disk space to increase to the amount required to hold inputs, scratch and outputs. However, it can take up to a minute to attach new storage so events that fill disk space in under a minute can result in failure.

Large numbers of inputs/ outputs

Typically, genomics files are large and best stored in S3. However, most applications used in genomics workflows cannot read directly from S3. Therefore, these inputs must be localized from S3. Compute work will not be able to begin until localization is complete so “divide and conquer” strategies are useful in these cases.

Whenever possible compress inputs (and outputs) appropriately. The CPU overhead of compression will be low compared to the network overhead of localization and delocalization.

Localization of large numbers of large files from S3 will put load on the network interface of the worker nodes and the node may experience transient network failures or S3 throttling. While we have included retry-with-backoff logic for localization it is not impossible that downloads may occasionally fail. Failures (and retries) will be recorded in the workflow task logs.

Parallel Steps

Workflows often contain parallel steps where many individual tasks are computed in parallel. Amazon Genomics CLI makes use of elastic compute clusters to scale to these requirements. Each context will deploy an elastic compute cluster with a minimum of 0 vCPU and a maximum of 256 vCPU. No individual task may use more than 256 vCPU. Smaller tasks may be run in parallel up to the maximum of 256 vCPU. Once that limit is met, additional tasks will be queued to run when capacity becomes free.

Each parallel task is isolated meaning each task will need a local copy of its inputs. When large numbers of parallel tasks, require the same inputs (for example reference genomes) you may observe contention for network resources and transient S3 failures. While we have included retry with backoff logic we recommend keeping the number of parallel tasks requiring the same inputs below 500. Fewer, if the tasks inputs are large.

An extreme example is Joint Genotyping. This type of analysis benefits from processing large numbers of samples at the same. Further, the user may wish to genotype many intervals concurrently. Finally, the step of merging the variant calls will import the variants from all intervals. In our experience, a naive implementation calling 100 samples over 100 intervals is feasible. Also, feasible is calling ~20 samples over 500 intervals. At larger scales it would be worth considering dividing tasks by chromosome or batching inputs.

Container throttling

Some container registries will throttle container access from anonymous accounts. Because each task in a workflow uses a container large or frequently run workflows may not be able to access their required containers. While compute clusters deployed by Amazon Genomics CLI are configured to cache containers this is only available on a per-instance basis. Further, due to the elastic nature of the clusters instances with cached container images are frequently shutdown. All of this will potentially lead to an excess of requests. To avoid this we recommend using registries that don’t impose these limits, or using images hosted in an ECR registry in your AWS account.