Projects
An Amazon Genomics CLI project defines the projects, contexts, data and workflows that make up a genomics analysis. Each project is defined
in a project file named agc-project.yaml
.
Project File Location
To find the project definition, Amazon Genomics CLI will look for a file named agc-project.yaml
in the current working directory. If
the file is not found, Amazon Genomics CLI will traverse up the file hierarchy until the file is found or until the root of the file
system is reached. If no project definition can be found an error will be reported. All Amazon Genomics CLI commands operate on the project identified by the above process.
Consider the example directory structure below:
/
├── baa/
│ ├── a/
│ └── agc-project.yaml
├── foo/
└── foz/
└── a/
├── agc-project.yaml
└── b/
└── c/
└── agc-project.yaml
- If the current working directory is
/baa
or/baa/a
then/baa/agc-project.yaml
will be used for definitions, - If the current working directory is
/foo
an error will be reported as no project file is found before the root, - If the current working directory is
/foz
an error will be reported as no project file is found before the root, - If the current working directory is
/foz/a
or/foz/a/b
then/foz/a/agc-project.yaml
will be used for definitions. - If the current working directory is
/foz/a/b/c
then/foz/a/b/c/agc-project.yaml
will be used for definitions.
Relative Locations
The location of resources declared in a project file are resolved relative to the location of the project file unless
they are declared using an absolute path. If the project file in /baa
declared that
there was a workflow definition in a/b/
then Amazon Genomics CLI will search for that definition in /baa/a/b/
.
Project File Structure
A minimal project file can be generated using the agc project init myProject --workflow-type nextflow
. Using myProject
as a project name and workflow type nextflow
will result in the following:
name: myProject
schemaVersion: 1
contexts:
ctx1:
engines:
- type: nextflow
engine: nextflow
This is fully usable project called “myProject” with a single context named “ctx1”. At this point “ctx1” can be deployed however, there are currently no workflows defined.
name
A string that identifies the project
schemaVersion
An integer defining the schema version. Version numbers will be incremented when changes are made to the project schema that are not backward compatible.
contexts
A map of context names to context definitions. Each context in the project must have a unique name. The contexts documentation provides more details.
workflows
A map of workflow names to workflow definitions. Workflow names must be unique in a project. The workflows documentation provides more details.
data
An array of data sources that the contexts of the project have access to. For example:
data:
- location: s3://gatk-test-data
readOnly: true
- location: s3://broad-references
readOnly: true
- location: s3://1000genomes-dragen-3.7.6
readOnly: true
You can use S3 prefixes to be more restrictive about access to data. For example, if you want to allow access to the
foo
folder of my-bucket
and it’s sub-folders you would declare the location as:
data:
- location: s3://my-bucket/foo/*
You can also grant access to a specific object (only) by providing the full path of the object. For example:
data:
- location: s3://my-bucket/foo/object
Commands
A full reference of project commands are available here
init
The agc project init <project-name> --workflow-type <worklow-type>
command can be used to initialize a minimal agc-project.yaml
file in the current
directory. Alternatively project yaml files can be created with any text editor.
describe
The agc project describe <project-name>
command will provide basic metadata about the ‘local’ project file. See
above for details on how project files are located.
validate
Using agc project validate
you can quickly identify any syntax errors in your local project file.
Versioning and Sharing
We recommend placing a project under source version control using a tool like Git. The folder containing the agc-project.yaml
file is a natural location for the root of a Git repository. Workflows relating to the project would naturally be located
in sub-folders of the same repository allowing those to be versioned as well. Alternatively, more advanced Git users may
consider storing workflows as a Git sub-module allowing them to
be independent of the project and reused among projects.
Projects and associated workflows can then be shared by “pushing” the project’s Git repository to a website such as GitHub, GitLab, or BitBucket or hosted on a private Git Server like AWS Code Commit. To facilitate sharing you should ensure that any file paths in your definitions are relative to the project and not absolute. You will also need to make sure that data locations are appropriately shared.
Costs
A project itself doesn’t have infrastructure. It is not deployed and therefore has no direct costs. If the contexts defined by an infrastructure are deployed or the workflows run then those will incur costs.
Tags
The project name
will be tagged
on any deployed contexts or workflows defined in this project allowing costs to be aggregated to the project level.
Technical Details
A project is purely a YAML definition. The values in the agc-project.yaml
file are used by CDK when Amazon Genomics CLI deploys contexts
and when Amazon Genomics CLI runs workflows. The project itself has no direct infrastructure. The project name
is used to help namespace
context infrastructure.