This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Concepts

What do you need to know about Amazon Genomics CLI in order to use it - or potentially contribute to it?

For a general introduction to the AWS Genomics CLI, refer to the Overview.

Amazon Genomics CLI uses a handful of core concepts to abstract the deployment of infrastructure needed to run workflows and to organize workflows and dependencies. Gaining an understanding of the concepts below will help you understand how Amazon Genomics CLI works and how it is organized.

1 - Accounts

How AWS Genomics CLI interacts with AWS Accounts

Amazon Genomics CLI requires an AWS account in which to deploy the cloud infrastructure required to run and manage workflows. To begin working with Amazon Genomics CLI and account must be “Activated” by the Amazon Genomics CLI application using the account activate command.

Which AWS Account is Used by Amazon Genomics CLI?

Amazon Genomics CLI uses the same AWS credential chain used by the AWS CLI to determine what account should be used and with what credentials. All that is required is that you have an existing AWS account (or create a new one) which contains at least one IAM Principal (User/ Role) that you have can access.

Which Region is Used by Amazon Genomics CLI?

Much like accounts and credentials, Amazon Genomics CLI uses the same chain used by the AWS CLI to determine the region that is being targeted. For example, if your AWS profile uses us-east-1 then Amazon Genomics CLI will use the same. Likewise, if you set the AWS_REGION environment variable to eu-west-1 then that region will be used by Amazon Genomics CLI for all subsequent commands in that shell.

Shared Infrastructure

When a region is first activated for Amazon Genomics CLI, some basic infrastructure is deployed including a VPC, which is used for the compute infrastructure that will be deployed in a context, and an S3 bucket which will be used to store workflow intermediates and results. This core infrastructure will be shared by all Amazon Genomics CLI users and projects in that region.

The following diagram shows the infrastructure deployed when the command agc account activate is run:

Image of shared infrastructure

Note that context specific infrastructure is not shared and is unique and namespaced by user and project.

Bring your Own VPC and S3 Bucket

During account activation you may specify an existing VPC ID or S3 bucket name for use by Amazon Genomics CLI. If you do not these will be created for you. Although we use AWS best practices for these, if your organization has specific security requirements for networking and storage this may be the easiest way to activate Amazon Genomics CLI in your environment.

Account Commands

A full reference of the account commands is here

activate

You can activate an account using agc account activate. An account must be activated before any contexts can be deployed or workflows run.

Activating an account will also bootstrap the AWS Environment for CDK app deployments.

Using an Existing S3 Bucket

Amazon Genomics CLI requires an S3 bucket to store workflow results and associated information. If you prefer to use an existing bucket you can use the form agc account activate --bucket my-existing-bucket. If you do this the AWS IAM role used to run Amazon Genomics CLI must be able to write to that bucket.

Using an Existing VPC

To use an existing VPC you can use the form agc account activate --vpc my-existing-vpc-id. This VPC must have at least 3 availability zones each with at least one private subnet. The private subnets must have connectivity to the internet, such as via a NAT gateway, and connectivity to AWS services either through VPC endpoints or the internet. Amazon Genomics CLI will not modify the network topology of the specified VPC.

Specifying Subnets

When using an existing VPC you may need to specify which subnets of the VPC can be used for infrastructure. This is useful when only some private subnets have internet routing. To do this you can supply a comma separated list of subnet IDs using the --subnets flag, or repeat the flag multiple times. For example:

agc account activate --vpc my-existing-vpc-id --subnets subnet-id-1,subnet-id-2 --subnets subnet-id-3

We recommend a minimum of 3 subnets across availability zones to take advantage of EC2 instance availability and to ensure high availability of infrastructure.

Using a Specific AMI for Compute Environments

Some organizations restrict the use of AMIs to a pre-approved list. By default, Amazon Genomics CLI uses the most recent version of the Amazon Linux 2 ECS Optimized AMI. To change this behavior you can supply the ID of an alternative AMI at account activation. This AMI will then be used for all compute environments used by all newly deployed contexts.

agc account activate --ami <ami-id>

There are some specific requirements that the AMI must comply with. It must be a private AMI from the same account that you will use for deploying Amazon Genomics CLI infrastructure. It must also be capable of successfully running all parts of the LaunchTemplate executed at startup time including the ecs-additions dependencies. We recommend an ECS optimized image based on Amazon Linux 2, RHEL, Fedora or similar.

If the LaunchTemplate cannot complete successfully it will result in an EC2 instance that cannot join a compute-cluster and cannot complete workflow tasks. A common symptom of this is workflow tasks that become stuck in a “runnable” state but are never assigned to a cluster node.

Using Only Public Subnets

Amazon Genomics CLI can create a new VPC with only public subnets to use for its infrastructure using the --usePublicSubnets flag.

agc account activate --usePublicSubnets

This can reduce costs by removing the need for NAT Gateways and VPC Gateway Endpoints to route internet traffic from private subnets. It can also reduce the number of Elastic IP Addresses consumed by your infrastructure.

contexts:
  myContext:
    usePublicSubnets: true
    engines:
      - type: nextflow
        engine: nextflow
Security Considerations

Although your infrastructure will be protected by security groups you should be aware that any manual modification of these may result in exposing your infrastructure to the internet. For this reason we do not recommend using this configuration in production or with sensitive data.

Updating

Issuing account activate commands more than once effectively updates the core infrastructure with the difference between the two commands according to the rules below.

Updating the VPC

You may change the VPC used by issuing the command agc account activate --vpc <vpc-id>. If a --vpc argument is not provided as part of an agc account activate command then the last VPC used will be ‘remembered’ and used by default.

If you wish to change to use a new default VPC created by Amazon Genomics CLI you must deactivate (agc account deactivate) and reactivate with no --vpc flag.

agc account activate               # VPC 1 created.
agc account activate --vpc-id abc  # VPC 1 destroyed and customer VPC abc used. 
agc account activate               # VPC 2 created. Customer VPC retained.
agc account deactivate             # AGC core infrastructure destroyed. Customer VPC abc retained.
Updating to Use Public Subnets Only

If you wish to change the VPC to use public subnets only, or change it from public subnets to private subnets you must deactivate the account and reactivate it with (or without) the --usePublicSubnets flag. For example:

agc account activate --usePublicSubnets # New VPC with only public subnets
agc account deactivate                  # VPC destroyed
agc account activate                    # New VPC with private subnets
Updating Selected Subnets

To change a VPC to use a different selection of subnets you must supply both the VPC id and the required subnet IDs. If you omit the --subnets flag, then future context deployments will use all private subnets of the VPC.

agc account activate --vpc <vpc-id> --subnets <subnet1,subnet2> # use subnets 1 and 2 of vpc-id 
agc account activate --vpc <vpc-id> --subnets <subnet1,subnet4> # use subnets 1 and 4 of vpc-id
agc account activate --vpc <vpc-id>                             # use all subnets of vpc-id
Updating the Compute-Environment AMI

The compute-environment AMI can be changed by re-issuing the account activate command with (or without) the --ami flag. If the flag is not provided the latest Amazon Linux 2 ECS optimized image will be used.

agc account activate                    # Latest Amazon Linux ECS Optimized AMI used for all contexts
agc account activate --ami <ami-1234>   # AMI 1234 used for new contexts
agc account activate                    # Latest Amazon Linux ECS Optimized AMI used for new contexts

deactivate

The deactivate command is used to remove the core infrastructure deployed by Amazon Genomics CLI in the current region when an account is activated. The S3 bucket deployed by Amazon Genomics CLI and its contents are retained. If a VPC and/ or S3 bucket were specified by the user during account activation these will also be retained. Any CloudWatch logs produced by Amazon Genomics CLI will also be retained.

If there are existing deployed contexts the command will fail, however, you can force the removal of these at the same time with the --force flag. Note that this will also interrupt any running workflow of any user in that region.

The deactivate command will only operate on infrastructure in the current region.

If the deployed infrastructure has been modified through the console or the AWS CLI rather than through Amazon Genomics CLI deactivation may fail due to the infrastructure state being inconsistent with the CloudFormation state. If this happens you may need to manually clean up through the CloudFormation console.

Costs

Core infrastructure deployed for Amazon Genomics CLI is tagged with the application-name: agc tag. This tag can be activated for cost tracking in AWS CostExplorer. The core infrastructure is shared and not tagged with any username, context name or project name.

While an account region is activated there will be ongoing charges from the core infrastructure deployed including things such as VPC NAT gateways and VPC Endpoints. If you no longer use Amazon Genomics CLI in a region we recommend you deactivate it. You may also wish to remove the S3 bucket along with its objects as well as the CloudWatch logs produced by Amazon Genomics CLI. These are retained by default so that you can view workflow results and logs even after deactivation.

However, if you wish to have this infrastructure remain deployed, you are able to significantly reduce ongoing costs by using agc account activate --usePublicSubnets. This prevents the creation of private subnets with NAT gateways, and the use of VPC endpoints, both of which have associated ongoing costs. Please note that you must also set usePublicSubnets: true in your agc-config.yaml if you choose to use this option. Please also note that this is not recommended for security-critical deployments, as it means that any edits to the stack security groups risk exposing worker nodes to the public internet.

Network traffic

When running genomics workflows, network traffic can become a significant expense when the traffic is routed through NAT gateways into private subnets (where worker nodes are usually located). To minimize these costs we recommend the use of VPC Enpoints (see below) as well as activating Amazon Genomics CLI and running your workflows in the same region as your S3 bucket holding your genome files. VPC Gateway endpoints are regional so cross region S3 access will not be routed through a VPC gateway.

If you make use of large container images in your workflows (such as the GATK images) we recommend copying these to a private ECR repository in the same region that you will run your analysis to use ECR endpoints and avoid traffic through NAT gateways.

VPC Endpoints

When Amazon Genomics CLI creates a VPC it creates the following VPC endpoints:

  • com.amazonaws.{region}.dynamodb
  • com.amazonaws.{region}.ecr.api
  • com.amazonaws.{region}.ecr.dkr
  • com.amazonaws.{region}.ecs
  • com.amazonaws.{region}.ecs-agent
  • com.amazonaws.{region}.ecs-telemetry
  • com.amazonaws.{region}.logs
  • com.amazonaws.{region}.s3
  • com.amazonaws.{region}.ec2

If you provide your own VPC we recommend that the VPC has these endpoints. This will improve the security posture of Amazon Genomics CLI in your VPC and will also reduce NAT gateway traffic charges which can be substantial for genomics analyses that use large S3 objects and/ or large container images.

If you are using Amazon Genomics CLI client on an EC2 instance in a subnet with no access to the internet you will need to have a VPC endpoint to com.amazonaws.{region}.execute-api so that the client can make calls to the REST services deployed during account activation.

Technical Details.

Amazon Genomics CLI core infrastructure is defined in code and deployed by AWS CDK. The CDK app responsible for creating the core infrastructure can be found in packages/cdk/apps/core/.

2 - Users

How Amazon Genomics CLI identifies users

When the CLI is set up, the user of the CLI is defined using the agc configure email command. This email should be unique to the individual user. This email address is used to determine a unique user ID which will be used to uniquely identify infrastructure belonging to that user.

Amazon Genomics CLI Users are Not IAM Users (or Principals)

Amazon Genomics CLI users are primarily used for identification and as a component of namespacing. They are not a security measure, nor are they related to IAM users or roles. All AWS activities carried out by Amazon Genomics CLI will be done using the AWS credentials in the environment where the CLI is installed and are not based on the Amazon Genomics CLI username.

For example. If Amazon Genomics CLI is installed on an EC2 instance and configured with the email someone@company.com Amazon Genomics CLI will interact with AWS resources based solely on the IAM Role assigned to that EC2 via it’s instance profile. Like wise if you use Amazon Genomics CLI on your laptop then the IAM role that you use will be determined by the same process as is used by the AWS CLI.

Who am I?

To find out what username and email has been configured in your current environment you can use the following command:

agc configure describe

Changing user

If you update your configured email, a new user identity is generated. If this is done while infrastructure is deployed, Amazon Genomics CLI may no longer be able to identify that infrastructure as belonging to your project. We strongly recommend stopping all running workflows and destroying all your deployed contexts from all projects before changing user. If you do not do this, you or an account administrator will need to identify any un-needed infrastructure in the CloudFormation console and remove it from there.

3 - Projects

A project defines the contexts, engines, data and workflows that make up a genomics analysis

An Amazon Genomics CLI project defines the projects, contexts, data and workflows that make up a genomics analysis. Each project is defined in a project file named agc-project.yaml.

Project File Location

To find the project definition, Amazon Genomics CLI will look for a file named agc-project.yaml in the current working directory. If the file is not found, Amazon Genomics CLI will traverse up the file hierarchy until the file is found or until the root of the file system is reached. If no project definition can be found an error will be reported. All Amazon Genomics CLI commands operate on the project identified by the above process.

Consider the example directory structure below:

/
├── baa/
│   ├── a/
│   └── agc-project.yaml
├── foo/
└── foz/
    └── a/
        ├── agc-project.yaml
        └── b/
            └── c/
                └── agc-project.yaml
  • If the current working directory is /baa or /baa/a then /baa/agc-project.yaml will be used for definitions,
  • If the current working directory is /foo an error will be reported as no project file is found before the root,
  • If the current working directory is /foz an error will be reported as no project file is found before the root,
  • If the current working directory is /foz/a or /foz/a/b then /foz/a/agc-project.yaml will be used for definitions.
  • If the current working directory is /foz/a/b/c then /foz/a/b/c/agc-project.yaml will be used for definitions.

Relative Locations

The location of resources declared in a project file are resolved relative to the location of the project file unless they are declared using an absolute path. If the project file in /baa declared that there was a workflow definition in a/b/ then Amazon Genomics CLI will search for that definition in /baa/a/b/.

Project File Structure

A minimal project file can be generated using the agc project init myProject --workflow-type nextflow. Using myProject as a project name and workflow type nextflow will result in the following:

name: myProject
schemaVersion: 1
contexts:
  ctx1:
    engines:
      - type: nextflow
        engine: nextflow

This is fully usable project called “myProject” with a single context named “ctx1”. At this point “ctx1” can be deployed however, there are currently no workflows defined.

name

A string that identifies the project

schemaVersion

An integer defining the schema version. Version numbers will be incremented when changes are made to the project schema that are not backward compatible.

contexts

A map of context names to context definitions. Each context in the project must have a unique name. The contexts documentation provides more details.

workflows

A map of workflow names to workflow definitions. Workflow names must be unique in a project. The workflows documentation provides more details.

data

An array of data sources that the contexts of the project have access to. For example:

data:
  - location: s3://gatk-test-data
    readOnly: true
  - location: s3://broad-references
    readOnly: true
  - location: s3://1000genomes-dragen-3.7.6
    readOnly: true

You can use S3 prefixes to be more restrictive about access to data. For example, if you want to allow access to the foo folder of my-bucket and it’s sub-folders you would declare the location as:

data:
  - location: s3://my-bucket/foo/*

You can also grant access to a specific object (only) by providing the full path of the object. For example:

data:
  - location: s3://my-bucket/foo/object

Commands

A full reference of project commands are available here

init

The agc project init <project-name> --workflow-type <worklow-type> command can be used to initialize a minimal agc-project.yaml file in the current directory. Alternatively project yaml files can be created with any text editor.

describe

The agc project describe <project-name> command will provide basic metadata about the ‘local’ project file. See above for details on how project files are located.

validate

Using agc project validate you can quickly identify any syntax errors in your local project file.

Versioning and Sharing

We recommend placing a project under source version control using a tool like Git. The folder containing the agc-project.yaml file is a natural location for the root of a Git repository. Workflows relating to the project would naturally be located in sub-folders of the same repository allowing those to be versioned as well. Alternatively, more advanced Git users may consider storing workflows as a Git sub-module allowing them to be independent of the project and reused among projects.

Projects and associated workflows can then be shared by “pushing” the project’s Git repository to a website such as GitHub, GitLab, or BitBucket or hosted on a private Git Server like AWS Code Commit. To facilitate sharing you should ensure that any file paths in your definitions are relative to the project and not absolute. You will also need to make sure that data locations are appropriately shared.

Costs

A project itself doesn’t have infrastructure. It is not deployed and therefore has no direct costs. If the contexts defined by an infrastructure are deployed or the workflows run then those will incur costs.

Tags

The project name will be tagged on any deployed contexts or workflows defined in this project allowing costs to be aggregated to the project level.

Technical Details

A project is purely a YAML definition. The values in the agc-project.yaml file are used by CDK when Amazon Genomics CLI deploys contexts and when Amazon Genomics CLI runs workflows. The project itself has no direct infrastructure. The project name is used to help namespace context infrastructure.

4 - Contexts

Contexts are the set of cloud resources used to run a workflow

What is a Context?

A context is a set of cloud resources. Amazon Genomics CLI runs workflows in a context. A deployed context will include an engine that can interpret and manage the running of a workflow along with compute resources that will run the individual tasks of the workflow. The deployed context will also contain any resources needed by the engine or compute resources including any security, permissions and logging capabilities. Deployed contexts are namespaced based on the user, project and context name so that resources are isolated, preventing collisions.

When a workflow is run the user will decide which context will run it. For example, you might choose to submit a workflow to a context that uses “Spot priced” resources or one that uses “On Demand” priced resources.

When deployed context resources that require a VPC will be deployed into the VPC that was specified when the account was activated.

How is a Context Defined?

A context is defined in the YAML file that defines the project. A project has at least one context but may have many. Contexts must have unique names and are defined as YAML maps.

A context may request use of Spot priced compute resources with requestSpotInstances: true. The default value is false.

A context must define an array of one or more engines. Each engine definition must specify the workflow language that it will interpret. For each language Amazon Genomics CLI has a default engine however, users may specify the exact engine in the engine parameter.

General Architecture of a Context

The exact architecture of a context will depend on the context properties described below and defined in their agc-project.yaml. However, the architecture deployed on execution of agc context deploy is shown in the following diagram:

Image of the general architecture of a context

Context Properties

Instance Types

You may optionally specify the instance types to be used in a context. This can be a specific type such as r5.2xlarge or it can be an instance family such as c5 or a combination. By default, a context will use instance types up to 4xlarge

Note, if you only specify large instance types you will be using those instances for running even the smallest tasks so we recommend including smaller types as well.

Ensure that any custom types you list are available in the region that you’re using with Amazon Genomics CLI or the context will fail to deploy. You can obtain a list using the following command

aws ec2 describe-instance-type-offerings \
    --region <region_name>

Examples

The following snippet defines two contexts, one that uses spot resources and one that uses on demand. Both contain a WDL engine.

...
contexts:
  # The on demand context uses on demand EC2 instances which may be more expensive but will not be interrupted
  onDemandCtx:
    requestSpotInstances: false
    engines:
      - type: wdl
        engine: cromwell
  # The spot context uses EC2 spot instances which are usually cheaper but may be interrupted
  spotCtx:
    requestSpotInstances: true
    engines:
      - type: wdl
        engine: cromwell
...

The following context may use any instance type from the m5, c5 or r5 families

contexts:
  nfLargeCtx:
    instanceTypes: [ "c5", "m5", "r5" ]
    engines:
      - type: nextflow
        engine: nextflow

Max vCpus

default: 256

You may optionally specify the maximum number of vCpus used in a context. This is the max total amount of vCpus of all the jobs currently running within a context. When the max has been reached, additional jobs will be queued.

note: if your vCPU limit is lower than maxVCpus then you won’t get as many as requested and would need to make a limit increase.

contexts:
  largeCtx:
    maxVCpus: 2000
    engines:
      - type: nextflow
        engine: nextflow

Public Subnets

In the interest of saving money, in particular if you intend to have the AGC stack deployed for a long period, you may choose to deploy in “public subnet” mode. To do this, you must first set up the core stack using aws configure --usePublicSubnets, which will disable the creation of the NAT gateway and VPC endpoints which present an ongoing cost unrelated to your use of compute resources. After you have done this, you must also set usePublicSubnets: true in all contexts you use:

contexts:
  someCtx:
    usePublicSubnets: true
    engines:
      - type: nextflow
        engine: nextflow

This ensures that the AWS batch instances are deployed into a public subnet, which has no additional cost associated with it. However note that while these instances are given a security group that will block all incoming traffic, this is not as secure as using the default private subnet mode.

Context Commands

A full reference of context commands is here

describe

The command agc context describe <context-name> [flags] will describe the named context as defined in the project YAML as well as other relevant account information.

list

The command agc context list [flags] will list the names of all contexts defined in the project YAML file along with the name of the engine used by the context.

deploy

The command agc context deploy <context-name> [flags] is used to deploy the cloud infrastructure required by the context. If the context is already running the existing infrastructure will be updated to reflect changes in project YAML. For example if you added another data definition in your project and run agc context deploy <context-name> then the deployed context will be updated to allow access to the new data.

All contexts defined in the project YAML can be deployed or updated using the --all flag.

Individually named contexts can be deployed or updated as positional arguments. For example: agc context deploy ctx1 ctx2 will deploy the contexts ctx1 and ctx2.

The inclusion of the --verbose flag will show the full CloudFormation output of the context deployment.

destroy

A contexts cloud resources can be “destroyed” using the agc context destroy <context-name> command. This will remove any infrastructure artifacts associated with the context unless they are defined as being retained. Typically, things like logs and workflow outputs on S3 are retained when a context is destroyed.

All deployed contexts can be destroyed using the --all flag.

Multiple contexts can be destroyed in a single command using positional arguments. For example: agc context destroy ctx1 ctx2 will destroy the contexts ctx1 and ctx2.

status

The status command is used to determine the status of a deployed context or context instance. This can be useful to determine if an instance of a particular context is already deployed. It can be used to determine if the deployed context is consistent with the defined context in the project YAML file. For example, if you deploy a context instance and later change the definition of the context in the project YAML file then the running instance will no longer reflect the definition. In this case you may choose to update the deployed instance using the agc context deploy command.

Status will only be shown for contexts for the current user in the current AWS region for the current project. To show contexts for another project, issue the command from that project’s home folder (or subfolder). To display contexts for another AWS region, you can use a different AWS CLI profile or set the AWS_PROFILE environment variable to the desired region (e.g export AWS_REGION=us-west-2).

Costs

Infrastructure deployed for a context is tagged with the context name as well as username and project name. These tags can be used with AWS CostExplorer to identify the costs associated with running contexts.

A deployed context will incur charges based on the resources being used by the context. If a workflow is running this will include compute costs for running the workflow tasks but some contexts may include infrastructure that is always “on” and will incur costs even when no workflow is running. If you no longer need a context we recommend pausing or destroying it.

If requestSpotInstances is true, the context will use spot instances for compute tasks. The context will set the max price to 100% although if the current price is lower you will pay the lower price. Note that even at 100% spot instances can still be interrupted if total demand for on demand instances in an availability zone exceeds the available pool. For full details see Spot Instance Interruptions and EC2 Spot Pricing.

Ongoing Costs

Until a context is destroyed resources that are deployed can incur ongoing costs even if a workflow is not running. The exact costs depend on the configuration of the context.

Amazon Genomics CLI version 1.0.1 and earlier used an AWS Fargate based WES service for each deployed context. The service uses 0.5 vCPU, 4 GB memory and 20 GB base instance storage. Fargate pricing varies by region and is detailed here. The estimated cost is available via this link

After version 1.0.1, the WES endpoints deployed by Amazon Genomics CLI are implemented with AWS Lambda and therefore use a pricing model based on invocations.

Contexts using a Cromwell engine run an additional AWS Fargate service for the engine with 2 vCPU, 16 GB RAM and 20 GB of base storage. Additionally, Cromwell is deployed with a standard EFS volume for storage of metadata. EFS costs are volume based. While relatively small the amount of metadata will expand as more workflows are run. The volume is destroyed when the context is destroyed. An estimated cost for both components is available via this link

Contexts using the “miniwdl” or “snakemake” engines use EFS volumes as scratch space for workflow intermediates, caches and temporary files. Because many genomics workflows can accumulate several GB of intermediates per run we recommend destroying these contexts when not in use. An estimated cost assuming a total of 500 GB of workflow artifacts is available via this link

Refer to the public subnets section if you are concerned about reducing these ongoing costs.

Tags

All context infrastructure is tagged with the context name, username and project name. These tags may be used to help differentiate costs.

Technical Details

Context infrastructure is defined as code as AWS CDK apps. For examples, take a look at the packages/cdk folder. When deployed a context will produce one or more stacks in Cloudformation. Details can be viewed in the Cloudformation console or with the AWS CLI.

A context includes an endpoint compliant with the GA4GH WES API. This API is how Amazon Genomics CLI submits workflows to the context. The context also contains one or more workflow engines. These may either be deployed as long-running services as is the case with Cromwell or as “head” jobs that are responsible for a single workflow, as is the case for NextFlow. Engines run as “head” jobs are started and stopped on demand thereby saving resources.

Updating Launch Templates

Changes to EC2 LaunchTemplates in CDK result in a new LaunchTemplate version when the infrastructure is updated. Currently, CDK is unable to also update the default version of the template. In addition, any existing AWS Batch Compute Environments will not be updated to use the new LaunchTemplate version. Because of this, whenever a LaunchTemplate is updated in CDK code we recommend destroying any relevant running contexts and redeploying them. An update will NOT be sufficient.

5 - Data

Data sets

To run an analysis you need data. In the agc-project.yaml file of an Amazon Genomics CLI project data is a list of data locations which can be used by the contexts of the project.

In the example data definition below we are declaring that the project’s contexts will be allowed to access the three listed S3 bucket URIs.

data:
  - location: s3://gatk-test-data
    readOnly: true
  - location: s3://broad-references
    readOnly: true
  - location: s3://1000genomes-dragen-3.7.6
    readOnly: true

The contexts of the project will be denied access to all other S3 location except for the S3 bucket created or associated when the account was initialized by Amazon Genomics CLI.

Declaring access in the project will only ensure your infrastructure is correctly configured to access the bucket. If the target bucket is further restricted, such as by an access control list or bucket policy, you will still be denied access. In these cases you should work with the bucket owner to facilitate access.

Read and Write

The default value of readOnly is true. At the time of writing, write access is not supported (except for the Amazon Genomics CLI core S3 bucket)

Access to a Prefix

The above examples will grant read access to an entire bucket. You can grant more granular access to a prefix within a bucket, for example:

data:
  - location: s3://my-bucket/my/prefix/

Cross Account Access

A bucket in another AWS account can be accessed if the owner has set up access, and you are using a role that is allowed access. See cross account access for details.

Updating Data Sources

If data definitions are added to or removed from a project definition the change will not be reflected in deployed contexts until they are updated. This can be done with agc context deploy --all for all contexts or by using a context name to update only one. See context deploy for details.

Technical Details

When a context is deployed, IAM roles used by the infrastructure of the context will be granted s3 permissions to perform some S3 read (or read and write) actions on the listed locations. The permissions are defined in CDK code in /packages/cdk/apps/. The CDK code does not modify any data in the buckets or any other bucket policies or configurations.

6 - Workflows

A Workflow is a series of steps or tasks to be executed as part of an analysis.

A Workflow is a series of steps or tasks to be executed as part of an analysis. To run a workflow using Amazon Genomics CLI, first you must have deployed a context with suitable compute resources and with a workflow engine that can interpret the language of the workflow.

Specification in Project YAML

In an Amazon Genomics CLI project you can specify multiple workflows in a YAML map. The following example defines four WDL version 1.0 workflows. The sourceURL property defines the location of the workflow file. If the location is relative then the relevant file is assumed to be relative to the location of the project YAML file. Absolute file locations are also possible although this may reduce the portability of a project if it is intended to be shared. Web URLS are supported as locations of the workflow definition file.

At this time Amazon Genomics CLI does not resolve path aliases so, for example, a sourceURL like ~/workflows/worklfow.wdl is not supported.

The type object declares the language of the workflow (eg, wdl, nextflow etc). The run a workflow there must be a deployed context with a matching language. The version property refers to the workflow language version.

workflows:
  hello:
    type:
      language: wdl
      version: 1.0
    sourceURL: workflows/hello.wdl
  read:
    type:
      language: wdl
      version: 1.0
    sourceURL: workflows/read.wdl
  words-with-vowels:
    type:
      language: wdl
      version: 1.0
    sourceURL: workflows/words-with-vowels.wdl

Multi-file Workflows

Some workflow languages allow for the import of other workflows. To accommodate this, Amazon Genomics CLI supports using a directory as a source URL. When a directory is supplied as the sourceURL, Amazon Genomics CLI uses the following rules to determine the name of the main workflow file and any supporting files:

  1. If the source URL resolves to a single non-zipped file, then the file is assumed to be a workflow file. Dependent resources (if any) are hardcoded in the file and must be resolvable by the Wes adapter or implicitly the workflow engine (e.g the Wes adapter figures out if the engine can resolve them and if not it resolves them itself).
  2. The source URL resolves to a zipped file (.zip). The zip may contain a manifest.
    1. If the zip file does not contain a file named MANIFEST.json:
      1. The zip file must contain one workflow file with the prefix main followed by the conventional suffix for the workflow, e.g. main.wdl
      2. Any sub-workflows or tasks referenced by the main workflow must either be in the zip at the appropriate relative path or they must be referenced by URLs that are resolvable by the workflow engine. The WesAdapter may attempt to resolve them for the engine but this is a convenience and not required.
      3. Any variables not defined in the workflows must be provided in an inputs file that is referenced via the input argument in AGC. For workflow engines that support multiple input files an index suffix must be provided (e.g. inputs_a.json or inputs_1.json) if there is more than one inputs file.
      4. A workflow options file may be included and must be named with the options prefix followed by the conventional suffix of the workflow. The WesAdapter may chose to make use of this depending on the context of the workflow engine. It may also choose to pass this to the workflow engine or pass a modified copy to the workflow engine.
    2. If the zip file does contain a manifest:
      1. The manifest must contain a parameter called mainWorkflowURL. If it does then the value of the parameter must either be a URL, including the relevant protocol, or the name of a file present in the zip archive. Any subworkflows or tasks imported by the main workflow must either be referenced as URLs in the workflow or be present in the archive as described above.
      2. The manifest may contain an array of URLs to inputs files called inputFileURLs. The WesAdapter must decide if it should resolve these or let the workflow engine resolve them.
      3. The manifest may contain a URL reference to an options files name optionFileURL. The WesAdapter may choose to make use of this depending on the context of the workflow engine. It may also choose to pass this to the workflow engine or pass a modified copy to the workflow engine.
  3. If the source URL points to a directory then Amazon Genomics CLI will zip the directory before uploading it. The directory must follow the same conventions stated above for zip files.

The following snippet demonstrates a possible declaration of a multi-file workflow:

workflows:
  gatk4-data-processing:
    type:
      langauge: wdl
      version: 1.0
    sourceURL: ./gatk4-data-processing

The following snippet demonstrates a valid MANIFEST.json file:

{
  "mainWorkflowURL": "processing-for-variant-discovery-gatk4.wdl",
  "inputFileURLs": [
    "processing-for-variant-discovery-gatk4.hg38.wgs.inputs.json"
  ],
  "optionFileURL": "options.json"
}

MANIFEST.json Structure

The following keys are allowed in the MANIFEST.json

Key Required Purpose
mainWorkflowURL Yes Points to the workflow definition that is the main entrypoint of the workflow.
inputFileURLs No An array of URLs to one or more JSON files that define the inputs to the workflow in the format expected by the relevant engine. inputFile URLs are resolved relative to the location of the MANIFEST.json

If multiple files are listed in inputFiles URLs, they will be passed to Cromwell in the order specified as workflowInput.json, workflowInput_1.json, … , workflowInput_5.json. (Note: Cromwell only supports up to 5 input.json files). If there are any properties in common between the files, values in higher numbered files will take precedence. See: Cromwell Docs
optionFileURL No A URL pointing to a JSON file containing engine options applied to a workflow instance. This is only used when engines run in server mode. Options are interpreted by the engine and so must be in the form expected by the engine. The URL is resolved relative to the location of the MANIFEST.json.
engineOptions No A string appended to the command line of the engine’s run command. The string may contain any flags or parameters relevant to the engine of the context used to run the workflow. It should not be used to declare inputs (use inputFileURLS instead). This parameter is only relevant for engines that run as head processes.

Engine Selection

When a workflow is submitted to run, Amazon Genomics CLI will match the workflow type with the map of engines in the context. For example, if the workflow type is wdl Amazon Genomics CLI will attempt to identify and engine designated as the engine for that type. There may only be one engine per type. If no suitable engine is found in the context an error will be reported.

Workflow Instances

Any defined project workflow can be run multiple times. Each run is called an instance and assigned a unique instance ID. When referring to a specific run of a workflow you should use the instance ID rather than the workflow name. It is possible to submit multiple instances of the same workflow and to have these run concurrently.

Context

All workflows are coordinated by the engine, they are submitted to and executed in the context that is specified at submission time. The workflow engine decides how the workflow is to be run and the context provides compute resources to run the workflow.

Commands

A full reference of workflow commands is available here

run

Invoking agc workflow run <workflow-name> -c <context-name> will run the named workflow in a specific context. The unique ID of that workflow instance run will be returned if the submission is successful.

workflow arguments

Workflow arguments such as inputs file can be specified at submission time using the i or --inputsFile flag. For example:

agc workflow run my-workflow --inputsFile inputs.json

If the inputs file references local files, these will be synced with S3 and those files in S3 will be used when the workflow instance is run.

workflow optionFileUrl

An additional optionFileUrl can be provided using the ‘o’ or ‘–optionFileUrl’ flag. For example:

agc workflow run my-workflow --optionFileUrl optionFile.json

OptionFileUrl is only for use with engines that run in server mode (e.g. Cromwell).

Example option.json

{
    "option_name_1": "option value 1",
    "option_name_2": "option value 2"
}

list

The agc workflow list command can be used to list all workflows that are specified in the current project.

describe

The agc workflow describe <workflow-name> command will return detailed information about the named workflow based on the specification in the current project YAML file.

status

To find out the status of workflow instances that are running, or have been run you can use the agc workflow status command. This will display details on 20 recent workflows from the project, to display more, or fewer you can use the --limit number flag where the number may be as many as 1000.

To list the status of workflows run or running in a specific context use the --context-name flag and provide the name of one of the contexts of the project.

You may get the status of workflow instances by workflow name using the --workflow-name flag.

To display the status of a specific workflow instance you can provide the id of the desired workflow instance with the --instance-id flag.

stop

A running workflow instance can be stopped at any time using the agc workflow stop <instance-id> command. When issued, Amazon Genomics CLI will look up the appropriate context and engine using the instance-id of the workflow and instruct the engine to stop the workflow. What happens next depends on the actual workflow engine. For example, in the case of the Cromwell WDL engine, any currently executing tasks will halt, any pending tasks will be removed from the work queue and no further tasks will be started for that workflow instance.

output

You can obtain the output (if any) of a completed workflow run using the output command and supplying the workflow run id. Typically, this is useful for locating the files produced by a workflow, although the actual output generated depends on the workflow specification and engine.

If the workflow declares outputs you may also obtain these using the command:

agc workflow output <workflow_run_id>

The following is an example of output from the “CramToBam” workflow run in a context using the Cromwell engine.

OUTPUT	id	aaba95e8-7512-48c3-9a61-1fd837ff6099
OUTPUT	outputs.CramToBamFlow.outputBai	s3://agc-123456789012-us-east-1/project/GATK/userid/mrschre4GqyMA/context/spotCtx/cromwell-execution/CramToBamFlow/aaba95e8-7512-48c3-9a61-1fd837ff6099/call-CramToBamTask/NA12878.bai
OUTPUT	outputs.CramToBamFlow.outputBam	s3://agc-123456789012-us-east-1/project/GATK/userid/mrschre4GqyMA/context/spotCtx/cromwell-execution/CramToBamFlow/aaba95e8-7512-48c3-9a61-1fd837ff6099/call-CramToBamTask/NA12878.bam
OUTPUT	outputs.CramToBamFlow.validation_report	s3://agc-123456789012-us-east-1/project/GATK/userid/mrschre4GqyMA/context/spotCtx/cromwell-execution/CramToBamFlow/aaba95e8-7512-48c3-9a61-1fd837ff6099/call-ValidateSamFile/NA12878.validation_report

Cost

Your account will be charged based on actual resource usage including compute time, storage, data transfer charges etc. The resources used will depend on the resources requested in your workflow definition as interpreted by the workflow engine according the resources made available in the context in which the workflow is run. If a spot context is used then the costs of the spot instances will be determined by the rules governing spot instance charges.

Tags

Resources used by Amazon Genomics CLI are tagged including the username, project name and the context name. Currently, tagging is not possible at the level of an individual workflow.

7 - Engines

Workflow engines parse and manage the tasks in a workflow

A workflow engine is defined as part of a context. A context is currently limited to one workflow engine. The workflow engine will manage the execution of any workflows submitted by Amazon Genomics CLI. When the context is deployed, an endpoint will be made available to Amazon Genomics CLI through which it will submit workflows and workflow commands to the engine according to the WES API specification.

Supported Engines and Workflow Languages

Currently, Amazon Genomics CLI’s officially supported engines can be used to run the following workflows:

Engine Language Language Versions Run Mode
Cromwell WDL All versions up to 1.0 Server
Nextflow Nextflow DSL Standard and DSL 2 Head Process
miniwdl WDL documented here Head Process
Snakemake Snakemake All versions Head Process
Toil CWL All versions up to 1.2 Server

Overtime we plan to add additional engine and language support and provide the ability for third party developers to develop engine plugins.

Run Mode

Server

In server mode the engine runs as a long-running process that exists for the lifetime of the context. All workflow instances sent to the context are handled by that server. The server resides on on-demand instances to prevent Spot interruption even if the workflow tasks are run on Spot instances

Head Process

Head process engines are run when a workflow is submitted, manage a single workflow and only run for the lifetime of the workflow. If multiple workflows are submitted to a context in parallel then multiple head processes are spawned. The head processes always run on on-demand resources to prevent Spot interruption even if the workflow tasks are run on Spot instances.

Engine Definition

An engine is defined within a context definition of the project YAML file file as a map. For example, the following snippet defines a WDL engine of type cromwell as part of the context named onDemandCtx. There may be one engine defined for each supported language.

contexts:
  onDemandCtx:
    requestSpotInstances: false
    engines:
      - type: wdl
        engine: cromwell

Commands

There are no commands specific to engines. Engines are deployed along with contexts by the context commands and workflows are run using the workflow commands.

Costs

The costs associated with an engine depend on the actual infrastructure required by the engine. In the case of the Cromwell, the engine runs in “server” mode as an AWS ECS Fargate container using an Amazon Elastic File System for metadata storage. The container will be running for the entire time the context is deployed, even when no workflows are running. To avoid this cost we recommend destroying the context when it is not needed. The Nextflow engine runs as a single batch job per workflow instance and is only running when workflows are running.

In both cases a serverless WES API endpoint is deployed through Amazon API Gateway to act as the interface between Amazon Genomics CLI and the engine.

Tags

Being part of a context, engine related infrastructure is tagged with the context name, username and project name. These tags may be used to help differentiate costs.

Technical Details

Supported engines are currently deployed with configurations that allow them to make use of files in S3 and submit workflows as jobs to AWS Batch. Because the current generation of engines we support do not directly support the WES API, adapters are deployed as Fargate container tasks. AWS API Gateway is used to provide a gateway between Amazon Genomics CLI and the WES adapters.

When workflow commands are issued on Amazon Genomics CLI, it will send WES API calls to the appropriate endpoint. The adapter mapped to that endpoint will then translate the WES command and either send the command to the engines REST API for Cromwell, or spawn a Nextflow engine task and submit the workflow with that task. At this point the engine is responsible for creating controlling and destroying the resources that will be used for task execution.

8 - Logs

Logs are produced by contexts, engines and workflow tasks. Understanding how to access them is critical to monitoring and debugging workflows.

The infrastructure deployed by Amazon Genomics CLI records logs for many activities including the workflow runs, workflow engines as well as infrastructure. The logs are recorded in CloudWatch but are accessible through the CLI.

When debugging or reviewing a workflow run, the engine logs and workflow logs will be the most useful. For diagnosing infrastructure or access problems the adapter logs and access logs will be informative.

Engine Logs

Engine logs are the logs produced by a workflow engine in a context. The logs produced depend on the engine implementation. Engines that run in “server” mode, such as Cromwell, produce a single log for the lifetime of the context that encompass all workflows run through that engine. Engines that run as “head node” will produce discrete engine logs for each run.

Workflow Logs

Workflow logs are the aggregate logs for all steps in a workflow run (instance). Any workflow steps that are retrieved from a call cache are not run so there will be no workflow logs for these steps. Consulting the engine logs may show details of the call cache. If a previously successful workflow is run with no changes in inputs or parameters it may have all steps retrieved from the cache in which case there will be no workflow logs although the workflow instance will be marked as a success and engine logs will be produced. The outputs for a completely cached workflow will also be available.

Adapter Logs

Adapter logs consist of any logs produced by a WES adapter for a workflow engine. They can reveal information such as the WES API calls that are made to the engine by Amazon Genomics CLI and any errors that may have occurred.

Access Logs

Amazon Genomics CLI talks to an engine via API Gateway which routes to the WES adapter. If an expected call does not appear in the adapter logs it may have been blocked or incorrectly routed in the API Gateway. The API Gateway access logs may be informative in this case.

Commands

A full reference of Amazon Genomics CLI logs commands are available here

Costs

Amazon Genomics CLI logs are stored in CloudWatch and accessed using the CloudWatch APIs. Standard CloudWatch charges apply. All logs are retained permanently, even after a context is destroyed and other Amazon Genomics CLI infrastructure is removed from an account. If they are no longer needed they may be removed using the AWS Console or the AWS CLI.

9 - Namespaces

Amazon Genomics CLI uses namespacing to prevent conflicts

Amazon Genomics CLI uses namespacing to prevent conflicts when there are multiple users, contexts, and projects in the same AWS account and region.

In any given account and region, an individual user may have many projects with many deployed contexts all running at the same time without conflict as long as:

  1. No other user with the same Amazon Genomics CLI username exists in the same account and region.
  2. All projects, used by that user, have a unique name.
  3. All contexts within a project have a unique name.

Shared Project Definitions

Project definitions can be shared between users. A simple way to achieve this is by putting the project YAML file and associated workflow definitions into a source control system like Git. If two users in the same account and region start contexts from the same project definition, these contexts are discrete and include the Amazon Genomics CLI username in the names of their respective infrastructures.

Therefore, the following combination are allowed:

userA -uses-> ProjectA -to-deploy-> ContextA
userB -uses-> ProjectA -to-deploy-> ContextA

In the above example it is useful to think of these as two instances of Context A. Both share the same definition but the instances do not have the same infrastructure.

Tags

All Amazon Genomics CLI infrastructure is tagged with the application-name key and a value of agc Aside from the core account infrastructure, all deployed infrastructure is tagged with the following key value pairs:

Key Value
agc-project The name of the project in which the context is defined
agc-user-id The unique username
agc-user-email The users email
agc-context The name of the context in which the infrastructure is deployed
agc-engine The name of the engine being run in the context
agc-engine-type The workflow language run by the engine