The following are complete worked examples made up of multiple tasks that guide you through a relatively simple but realistic scenarios using Amazon Genomics CLI.
This is the multi-page printable view of this section. Click here to print.
Tutorials
- 1: Walk through
1 - Walk through
Prerequisites
Ensure you have completed the prerequisites before beginning.
Download and install Amazon Genomics CLI
Download the Amazon Genomics CLI according to the installation instructions.
Setup
Ensure you have initialized your account and created a username by following the setup instructions.
Initialize a project
Amazon Genomics CLI uses local folders and config files to define projects. Projects contain configuration settings for contexts and workflows (more on these below). To create a new project for running WDL based workflows do the following:
mkdir myproject
cd myproject
agc project init myproject --workflow-type wdl
NOTE: for a Nextflow based project you can substitute
--workflow-type wdl
with---workflow-type nextflow
.
Projects may have workflows from different languages, so the --workflow-type
flag is simply to provide the stub for
an initial workflow engine.
This will create a config file called agc-project.yaml
with the following contents:
name: myproject
schemaVersion: 1
contexts:
ctx1:
engines:
- type: wdl
engine: cromwell
This config file will be used to define aspects of the project - e.g. the contexts and named workflows the project uses. For a
more representative project config, look at the projects in ~/agc/examples
. Unless otherwise stated, command line activities for the remainder of this document will assume they are run from within the ~/agc/examples/demo-wdl-project/
project folder.
Contexts
Amazon Genomics CLI uses a concept called “contexts” to run workflows. Contexts encapsulate and automate time-consuming tasks
like configuring and deploying workflow engines, creating data access policies, and tuning compute clusters for operation at scale.
In the demo-wdl-project
folder, after running the following:
agc context list
You should see something like:
2021-09-22T01:15:41Z 𝒊 Listing contexts.
CONTEXTNAME cromwell myContext
CONTEXTNAME cromwell spotCtx
In this project there are two contexts, one configured to run with On-Demand instances (myContext), and one configured to use SPOT instances (spotCtx).
You need to have a context running to be able to run workflows. To deploy the context myContext
in the demo-wdl-project, run:
agc context deploy myContext
This will take 10-15min to complete.
If you have more than one context configured, and want to deploy them all at once, you can run:
agc context deploy --all
Contexts have read-write access to a context specific prefix in the S3 bucket Amazon Genomics CLI creates during account activation. You can check this for the myContext
context with:
agc context describe myContext
You should see something like:
CONTEXT myContext false STARTED
OUTPUTLOCATION s3://agc-123456789012-us-east-2/project/Demo/userid/xxxxxxxxJKP3z/context/myContext
WESENDPOINT https://a1b2c3d4.execute-api.us-east-2.amazonaws.com/prod/ga4gh/wes/v1
You can add more data locations using the data
section of the agc-project.yaml
config file. All contexts will have an
appropriate access policy created for the data locations listed when they are deployed. For example, the following config adds three public buckets from the Registry of Open Data on AWS:
name: myproject
data:
- location: s3://broad-references
readOnly: true
- location: s3://gatk-test-data
readOnly: true
- location: s3://1000genomes
readOnly: true
Note, you need to redeploy any running contexts to update their access to data locations. Do this by simply (re)running.
agc context deploy myContext
Contexts also define what types of compute your workflow will run on - i.e. if you want to run workflows using SPOT or On-demand instances. By default, contexts use On-demand instances. The configuration for a context that uses SPOT instances looks like the following:
contexts:
# The spot context uses EC2 spot instances which are usually cheaper but may be interrupted
spotCtx:
requestSpotInstances: true
You can also explicitly specify what instance types contexts will be able to use for workflow jobs. By default, Amazon Genomics CLI
will use a select set of instance types optimized for running genomics workflow jobs that balance data I/O performance and
mitigation of workflow failure due to SPOT reclamation. In short, Amazon Genomics CLI uses AWS Batch for job execution
and selects instance types based on the requirements of submitted jobs, up to 4xlarge
instance types. If you have a use case
that requires a specific set of instance types, you can define them with something like:
contexts:
specialCtx:
instanceTypes:
- c5
- m5
- r5
The above will create a context called specialCtx
that will use any size of instances in the C5, M5, and R5 instance families.
Contexts are elastic with a minimum vCPU capacity of 0 and a maximum of 256. When all vCPUs are allocated to jobs, further
tasks will be queued.
Contexts also launch an engine for specific workflow types. You can have one engine per context and, currently, engines for WDL and Nextflow are supported.
A contexts configured with WDL and Nextflow engines respectively look like:
contexts:
wdlContext:
engines:
- type: wdl
engine: cromwell
nfContext:
engines:
- type: nextflow
engine: nextflow
Workflows
Add a workflow
Bioinformatics workflows are written in languages like WDL and Nextflow in either single script files, or in packages of multiple files (e.g. when there are multiple related workflows that leverage reusable elements). Currently, Amazon Genomics CLI supports both WDL and Nextflow. To learn more about WDL workflows, we suggest resources like the OpenWDL - Learn WDL course. To learn more about Nextflow workflows, we suggest Nextflow’s documentation and NF-Core.
For clarity, we’ll refer to these workflow script files as “workflow definitions”. A “workflow specification” for Amazon Genomics CLI references workflow definitions and combines it with additional metadata, like the workflow language the definition is written in, which Amazon Genomics CLI will use to execute it on appropriate compute resources.
There is a “hello” workflow definition in the ~/agc/examples/demo-wdl-project/workflows/hello
folder that looks like:
version 1.0
workflow hello_agc {
call hello {}
}
task hello {
command { echo "Hello Amazon Genomics CLI!" }
runtime {
docker: "ubuntu:latest"
}
output { String out = read_string( stdout() ) }
}
The workflow specification for this workflow in the project config looks like:
workflows:
hello:
type:
language: wdl
version: 1.0
sourceURL: workflows/hello.wdl
Here the workflow is expected to conform to the WDL-1.0
specification. A specification for a “hello” workflow written in Nextflow DSL1 would look like:
workflows:
hello:
type:
language: nextflow
version: 1.0
sourceURL: workflows/hello
For Nextflow DSL2 workflows set type.version
to dsl2
.
NOTE: When referring to local workflow definitions, sourceURL
must either be a full absolute path or a path relative to the agc-project.yaml
file. Path expansion is currently not supported.
You can quickly get a list of available configured workflows with:
agc workflow list
For the demo-wdl-project
, this should return something like:
2021-09-02T05:14:47Z 𝒊 Listing workflows.
WORKFLOWNAME hello
WORKFLOWNAME read
WORKFLOWNAME words-with-vowels
The hello
workflow specification points to a single-file workflow. Workflows can also be directories. For example, if you have a workflow that looks like:
workflows/hello-dir
|-- inputs.json
`-- main.wdl
The workflow specification for the workflow above would simply point to the parent directory:
workflows:
hello-dir-abs:
type:
language: wdl
version: 1.0
sourceURL: /abspath/to/hello-dir
hello-dir-rel:
type:
language: wdl
version: 1.0
sourceURL: relpath/to/hello-dir
In this case, your workflow must be named main.<workflow-type>
- e.g. main.wdl
You can also provide a MANIFEST.json
file that points to a specific workflow file to run. If you have a folder like:
workflows/hello-manifest/
|-- MANIFEST.json
|-- hello.wdl
|-- inputs.json
`-- options.json
The MANIFEST.json
file would be:
{
"mainWorkflowURL": "hello.wdl",
"inputFileURLs": [
"inputs.json"
]
}
At minimum, MANIFEST files must have a mainWorkflowURL
property which is a relative path to the workflow file in its parent directory.
Workflows can also be from remote sources like GitHub:
workflows:
remote:
type:
language: wdl
version: 1.0 # this is the WDL spec version
sourceURL: https://raw.githubusercontent.com/openwdl/learn-wdl/master/1_script_examples/1_hello_worlds/1_hello/hello.wdl
NOTE: remote sourceURLs for Nextflow workflows can be Git repo URLs like: https://github.com/nextflow-io/rnaseq-nf.git
Running a workflow
To run a workflow you need a running context. See the section on contexts above if you need to start one. To run the “hello” workflow in the “myContext” context, run:
agc workflow run hello --context myContext
If you have another context in your project, for example one named “test”, you can run the “hello” workflow there with:
agc workflow run hello --context test
If your workflow was successfully submitted you should get something like:
2021-08-04T23:01:37Z 𝒊 Running workflow. Workflow name: 'hello', InputsFile: '', Context: 'myContext'
"06604478-0897-462a-9ad1-47dd5c5717ca"
The last line is the workflow run id. You use this id to reference a specific workflow execution.
Running workflows is an asynchronous process. After submitting a workflow from the CLI, it is handled entirely in the cloud. You can now close your terminal session if needed. The workflow will still continue to run. You can also run multiple workflows at a time. The underlying compute resources will automatically scale. Try running multiple instances of the “hello” workflow at once.
You can check the status of all running workflows with:
agc workflow status
You should see something like this:
WORKFLOWINSTANCE myContext 66826672-778e-449d-8f28-2274d5b09f05 true COMPLETE 2021-09-10T21:57:37Z hello
By default, the workflow status
command will show the state of all workflows across all running contexts.
To show only the status of workflow instances of a specific workflow you can use:
agc workflow status -n <workflow-name>
To show only the status of workflows instances in a specific context you can use:
agc workflow status -c <context-name>
If you want to check the status of a specific workflow you can do so by referencing the workflow execution by its run id:
agc workflow status -r <workflow-run-id>
If you need to stop a running workflow instance, run:
agc workflow stop <workflow-run-id>
Using workflow inputs
You can provide runtime inputs to workflows at the command line. For example, the demo-wdl-project
has a workflow named read
that requires reading a data file.
The specification of read looks like:
version 1.0
workflow ReadFile {
input {
File input_file
}
call read_file { input: input_file = input_file }
}
task read_file {
input {
File input_file
}
String content = read_string(input_file)
command {
echo '~{content}'
}
runtime {
docker: "ubuntu:latest"
memory: "4G"
}
output { String out = read_string( stdout() ) }
}
You can create an input file locally for this workflow:
mkdir inputs
echo "this is some data" > inputs/data.txt
cat << EOF > inputs/read.inputs.json
{"ReadFile.input_file": "data.txt"}
EOF
Finally, you would submit the workflow with its corresponding inputs file with:
agc workflow run read --inputsFile inputs/read.inputs.json
Amazon Genomics CLI will scan the file provided to --inputsFile
for local paths, sync those files to S3, and rewrite the
inputs file in transit to point to the appropriate S3 locations. Paths in the *.inputs.json
file provided as --inputsFile
are referenced relative to the *.inputs.json
file.
Accessing workflow results
Workflow results are written to an S3 bucket specified or created by Amazon Genomics CLI during account activation. See the section on account activation above for more details. You can list or retrieve the S3 URI for the bucket with:
AGC_BUCKET=$(aws ssm get-parameter \
--name /agc/_common/bucket \
--query 'Parameter.Value' \
--output text)
and then use aws s3
commands to explore and retrieve data from the bucket. For example, to list the bucket contents:
aws s3 ls $AGC_BUCKET
You should see something like:
PRE project/
PRE scripts/
Data for multiple projects are kept in project/<project-name>
prefixes. Looking into one you should see:
PRE cromwell-execution/
PRE workflow/
The cromwell-execution
prefix is specific to the engine Amazon Genomics CLI uses to run WDL workflows.
Workflow results will be in cromwell-execution
partitioned by workflow name, workflow run id, and task name. The workflow
prefix is where named workflows are cached when you run workflows definitions stored in your local environment.
If a workflow declares workflow outputs then these can be obtained using agc workflow output <run_id>
The following is example output from the “cram-to-bam” workflow
OUTPUT id aaba95e8-7512-48c3-9a61-1fd837ff6099
OUTPUT outputs.CramToBamFlow.outputBai s3://agc-123456789012-us-east-1/project/GATK/userid/mrschre4GqyMA/context/spotCtx/cromwell-execution/CramToBamFlow/aaba95e8-7512-48c3-9a61-1fd837ff6099/call-CramToBamTask/NA12878.bai
OUTPUT outputs.CramToBamFlow.outputBam s3://agc-123456789012-us-east-1/project/GATK/userid/mrschre4GqyMA/context/spotCtx/cromwell-execution/CramToBamFlow/aaba95e8-7512-48c3-9a61-1fd837ff6099/call-CramToBamTask/NA12878.bam
OUTPUT outputs.CramToBamFlow.validation_report s3://agc-123456789012-us-east-1/project/GATK/userid/mrschre4GqyMA/context/spotCtx/cromwell-execution/CramToBamFlow/aaba95e8-7512-48c3-9a61-1fd837ff6099/call-ValidateSamFile/NA12878.validation_report
Accessing workflow logs
You can get a summary of the log information for a workflow as follows:
agc logs workflow <workflow-name>
This will return the logs for all runs of the workflow. If you just want the logs for a specific workflow run, you can use:
agc logs workflow <workflow-name> -r <workflow-instance-id>
This will print out the stdout
generated by each workflow task.
For the hello workflow above, this would look like:
Fri, 10 Sep 2021 22:00:04 +0000 download: s3://agc-123456789012-us-east-2/scripts/3e129a27928c192b7804501aabdfc29e to tmp/tmp.BqCb2iaae/batch-file-temp
Fri, 10 Sep 2021 22:00:04 +0000 *** LOCALIZING INPUTS ***
Fri, 10 Sep 2021 22:00:05 +0000 download: s3://agc-123456789012-us-east-2/project/Demo/userid/xxxxxxxxJKP3z/context/myContext/cromwell-execution/hello_agc/66826672-778e-449d-8f28-2274d5b09f05/call-hello/script to agc-123456789012-us-east-2/project/Demo/userid/xxxxxxxxJKP3z/context/myContext/cromwell-execution/hello_agc/66826672-778e-449d-8f28-2274d5b09f05/call-hello/script
Fri, 10 Sep 2021 22:00:05 +0000 *** COMPLETED LOCALIZATION ***
Fri, 10 Sep 2021 22:00:05 +0000 Hello Amazon Genomics CLI!
Fri, 10 Sep 2021 22:00:05 +0000 *** DELOCALIZING OUTPUTS ***
Fri, 10 Sep 2021 22:00:05 +0000 upload: ./hello-rc.txt to s3://agc-123456789012-us-east-2/project/Demo/userid/xxxxxxxxJKP3z/context/myContext/cromwell-execution/hello_agc/66826672-778e-449d-8f28-2274d5b09f05/call-hello/hello-rc.txt
Fri, 10 Sep 2021 22:00:06 +0000 upload: ./hello-stderr.log to s3://agc-123456789012-us-east-2/project/Demo/userid/xxxxxxxxJKP3z/context/myContext/cromwell-execution/hello_agc/66826672-778e-449d-8f28-2274d5b09f05/call-hello/hello-stderr.log
Fri, 10 Sep 2021 22:00:06 +0000 upload: ./hello-stdout.log to s3://agc-123456789012-us-east-2/project/Demo/userid/xxxxxxxxJKP3z/context/myContext/cromwell-execution/hello_agc/66826672-778e-449d-8f28-2274d5b09f05/call-hello/hello-stdout.log
Fri, 10 Sep 2021 22:00:06 +0000 *** COMPLETED DELOCALIZATION ***
If your workflow fails, useful debug information is typically reported by the workflow engine logs. These are unique per context. To get those for a context named myContext
, you would run:
agc logs engine --context myContext
You should get something like:
Fri, 10 Sep 2021 23:40:49 +0000 2021-09-10 23:40:49,421 cromwell-system-akka.dispatchers.api-dispatcher-175 INFO - WDL (1.0) workflow 1473f547-85d8-4402-adfc-e741b7df69f2 submitted
Fri, 10 Sep 2021 23:40:52 +0000 2021-09-10 23:40:52,711 cromwell-system-akka.dispatchers.engine-dispatcher-30 INFO - 1 new workflows fetched by cromid-2054603: 1473f547-85d8-4402-adfc-e741b7df69f2
Fri, 10 Sep 2021 23:40:52 +0000 2021-09-10 23:40:52,712 cromwell-system-akka.dispatchers.engine-dispatcher-14 INFO - WorkflowManagerActor: Starting workflow UUID(1473f547-85d8-4402-adfc-e741b7df69f2)
Fri, 10 Sep 2021 23:40:52 +0000 2021-09-10 23:40:52,712 cromwell-system-akka.dispatchers.engine-dispatcher-14 INFO - WorkflowManagerActor: Successfully started WorkflowActor-1473f547-85d8-4402-adfc-e741b7df69f2
Fri, 10 Sep 2021 23:40:52 +0000 2021-09-10 23:40:52,712 cromwell-system-akka.dispatchers.engine-dispatcher-14 INFO - Retrieved 1 workflows from the WorkflowStoreActor
Fri, 10 Sep 2021 23:40:52 +0000 2021-09-10 23:40:52,716 cromwell-system-akka.dispatchers.engine-dispatcher-14 INFO - MaterializeWorkflowDescriptorActor [UUID(1473f547)]: Parsing workflow as WDL 1.0
Fri, 10 Sep 2021 23:40:52 +0000 2021-09-10 23:40:52,721 cromwell-system-akka.dispatchers.engine-dispatcher-14 INFO - MaterializeWorkflowDescriptorActor [UUID(1473f547)]: Call-to-Backend assignments: hello_agc.hello -> AWSBATCH
Fri, 10 Sep 2021 23:40:52 +0000 2021-09-10 23:40:52,722 WARN - Unrecognized configuration key(s) for AwsBatch: auth, numCreateDefinitionAttempts, filesystems.s3.duplication-strategy, numSubmitAttempts, default-runtime-attributes.scriptBucketName
Fri, 10 Sep 2021 23:40:53 +0000 2021-09-10 23:40:53,741 cromwell-system-akka.dispatchers.engine-dispatcher-14 INFO - WorkflowExecutionActor-1473f547-85d8-4402-adfc-e741b7df69f2 [UUID(1473f547)]: Starting hello_agc.hello
Fri, 10 Sep 2021 23:40:54 +0000 2021-09-10 23:40:54,030 cromwell-system-akka.dispatchers.engine-dispatcher-14 INFO - Assigned new job execution tokens to the following groups: 1473f547: 1
Fri, 10 Sep 2021 23:40:55 +0000 2021-09-10 23:40:55,501 cromwell-system-akka.dispatchers.engine-dispatcher-4 INFO - 1473f547-85d8-4402-adfc-e741b7df69f2-EngineJobExecutionActor-hello_agc.hello:NA:1 [UUID(1473f547)]: Call cache hit process had 0 total hit failures before completing successfully
Fri, 10 Sep 2021 23:40:56 +0000 2021-09-10 23:40:56,842 cromwell-system-akka.dispatchers.engine-dispatcher-31 INFO - WorkflowExecutionActor-1473f547-85d8-4402-adfc-e741b7df69f2 [UUID(1473f547)]: Job results retrieved (CallCached): 'hello_agc.hello' (scatter index: None, attempt 1)
Fri, 10 Sep 2021 23:40:57 +0000 2021-09-10 23:40:57,820 cromwell-system-akka.dispatchers.engine-dispatcher-4 INFO - WorkflowExecutionActor-1473f547-85d8-4402-adfc-e741b7df69f2 [UUID(1473f547)]: Workflow hello_agc complete. Final Outputs:
Fri, 10 Sep 2021 23:40:57 +0000 {
Fri, 10 Sep 2021 23:40:57 +0000 "hello_agc.hello.out": "Hello Amazon Genomics CLI!"
Fri, 10 Sep 2021 23:40:57 +0000 }
Fri, 10 Sep 2021 23:40:59 +0000 2021-09-10 23:40:59,826 cromwell-system-akka.dispatchers.engine-dispatcher-14 INFO - WorkflowManagerActor: Workflow actor for 1473f547-85d8-4402-adfc-e741b7df69f2 completed with status 'Succeeded'. The workflow will be removed from the workflow store.
You can filter logs with the --filter
flag. The filter syntax adheres to CloudWatch’s filter and pattern syntax.
For example, the following will give you all error logs from the workflow engine:
agc logs engine --context myContext --filter ERROR
Additional workflow examples
The Amazon Genomics CLI installation also includes a set of typical genomics workflows for raw data processing, germline variant discovery, and joint genotyping based on GATK Best Practices, developed by the Broad Institute. More information on how these workflows work is available in the GATK Workflows Github repository.
You can find these in:
~/agc/examples/gatk-best-practices-project
These workflows come pre-packaged with MANIFEST.json
files that specify example input data available publicly in the AWS Registry of Open Data.
Note: these workflows take between 5 min to ~3hrs to complete.
Cleanup
When you are done running workflows, it is recommended you stop all cloud resources to save costs.
Stop a context with:
agc context destroy <context-name>
This will destroy all compute resources in a context, but retain any data in S3. If you want to destroy all your running contexts at once, you can use:
agc context destroy --all
Note, you will not be able to destroy a context that has a running workflow. Workflows will need to complete on their own or stopped before you can destroy the context.
If you want stop using Amazon Genomics CLI in your AWS account entirely, you need to deactivate it:
agc account deactivate
This will remove Amazon Genomics CLI’s core infrastructure. If Amazon Genomics CLI created a VPC as part of the activate
process, it will be removed. If Amazon Genomics CLI created an S3 bucket for you, it will be retained.
To uninstall Amazon Genomics CLI from your local machine, run the following command:
./agc/uninstall.sh
Note uninstalling the CLI will not remove any resources or persistent data from your AWS account.